Sphinx-3 FAQ

Rita Singh
Sphinx Speech Group
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213

This document is constantly under construction. You will find the most up-to-date version of this document here.

INDEX
  1. Data-preparation for training acoustic models
  2. Selecting modeling parameters
  3. Feature computation
  4. Modeling filled pauses and non-speech events
  5. Training speed
  6. Questions specific to log files
  7. Vector-quantization for discrete and semi-continuous models
  8. Flat-initialization
  9. Updating existing models
  10. Utterance, word and phone segmentations
  11. Force-alignment(Viterbi alignment)
  12. Baum-Welch iterations and associated likelihoods
  13. Dictionaries, pronunciations and phone-sets
  14. Decision tree building and parameter sharing
  15. Post-training disasters
  16. Why is my recognition accuracy poor?
  17. Interpreting SPHINX-II file formats
  18. Interpreting SPHINX-III file formats
  19. Hypothesis combination
  20. Language model
  21. Training context-dependent models with untied states
  22. Acoustic likelihoods and scores
  23. Decoding problems
  24. (Added at 20040910 by Arthur Chan) Why Sphinx III's performance is poorer than recognizer X?

DATA-PREPARATION FOR TRAINING ACOUSTIC MODELS

Q. How can I tell the size of my speech corpus in hours? Can I use it all for training?

A. You can only train with utterances for which you have transcripts. You cannot usually tell the size of your corpus from the number of utterances you have. Sometimes utterances are very long, and at other times they may be as short as a single word or sound. The best way to estimate the size of your corpus in hours is to look at the total size in bytes of all utterance files which you can use to train your models. Speech data are usually stored in integer format. Assuming that is so and ignoring any headers that your file might have, an approximate estimate of the size of your corpus in hours can be obtained from the following parameters of the speech data:
 
Sampling Rate:  If this is S KiloHertz, then there are S x1000 samples or integers in every second of your data.
Sample Size: If your sampling is "8bit" then every integer has 1 byte associated with it. If it is "16bit" then every integer in you data has 2 bytes associated with it
Hour Size: 3600 seconds in an hour

Here's a quick reference table:
 
No. of bytes Sampling rate Sample size Hours of data
X 8khz 8bit X / (8000*1*3600)
X 16khz 16bit X / (1600*2*3600)

Q: I have about 7000 utterances or 12 hours of speech in my training set. I found faligned transcripts for all but 114 utterances, and those 114 utterances have no transcripts that I can find. Should I leave them out of the training? I don't think it will make that much difference at its 1.4% of the data. Also, for this much data, how many senones should I use?

A: Leave out utterances for which you don't have transcripts (unless you have very little data in the first place, in which case hear out the audio and transcribe it yourself). In this case, just leave them out.

Thumb rule figures for the number of senones that you should be training are given in the following table:
 

Amount of training data(hours) No. of senones
1-3 500-1000
4-6 1000-2500
6-8 2500-4000
8-10 4000-5000
10-30  5000-5500
30-60 5500-6000
60-100  6000-8000
Greater than 100 8000 are enough

Q: What is force-alignment? Should I force-align my transcripts before I train?

A: The process of force-alignment takes an existing transcript, and finds out which, among the many pronunciations for the words occuring in the transcript, are the correct pronunciations. So when you refer to "force-aligned" transcripts, you are also inevitably referring to a *dictionary* with reference to which the transcripts have been force-aligned. So if you have two dictionaries and one has the word "PANDA" listed as:
 
PANDA P AA N D AA
PANDA(2) P AE N D AA
PANDA(3) P AA N D AX

and the other one has the same word listed as
 
PANDA P AE N D AA
PANDA(2) P AA N D AX
PANDA(3) P AA N D AA

And you force-aligned using the first dictionary and get your transcript to look like :
I  SAW  A  PANDA(3)  BEAR,
then if you used that transcript to train but used the second dictionary to train, then you would be giving the wrong pronunciation to the trainer. You would be telling the trainer that the pronunciation for the word PANDA in your corpus is "P AA N D AA"  instead of the correct one, which should have been "P AA N D AX". The data corresponding to the phone AXwill now be wrongly used to train the phone AA.

What you must really do is to collect your transcripts, use only the first listed pronunciation in your training dictionary, train ci models, and use *those ci models* to force-align your transcripts against the training dictionary. Then go all the way back and re-train your ci models with the new transcripts.
 

Q: I don't have transcripts. How can I force-align?

A: you cannot force-align any transcript that you do not have.
 

Q: I am going to first train a set of coarse models to force-align the transcripts. So I should submit begining and end silence marked transcripts to the trainer for the coarse models. Currently I am keeping all the fillers, such as UM, BREATH, NOISE etc. in my transcripts, but wrapped with "+". Do you think the trainer will consider them as fillers instead of normal words?

A: According to the trainer, ANY word listed in the dictionary in terms of any phone/sequence of phones is a valid word. BUT the decision tree builder ignores any +word+ phone as a noise phone and does not build decision trees for the phone. So while training, mark the fillers as ++anything++ in the transcript and then see that either the filler dictionary or the main dictionary has some mapping

++anything++ +something+

where +something+ is a phone listed in your phonelist.

Q: I have a huge collection of data recorded under different conditions. I would like to train good speaker-independent models using this (or a subset of this) data. How should I select my data? I also suspect that some of the transcriptions are not very accurate, but I can't figure out which ones are inaccurate without listening to all the data.

A. If the broad acoustic conditions are similar (for example, if all your data has been recorded off TV shows), it is best to use all data you can get for training speaker-independent bandwidth-independent models, gender-independent models. If you suspect that some of the data you are using might be bad for some reason, then during the baum-welch iterations you can monitor the likelihoods corresponding to each utterance and discard the really low-likelihood utterances. This would filter out the bad acoustic/badly transcribed data.

Q: What is the purpose of the 4th field in the control file:

newfe/mfcc/sw02001 24915 25019 /phase1/disc01/sw2001.txt_a_249.15_250.19

Should I leave the /phase1/disc01... as is or should it be formatted differently? I'm not sure where/why this field is used so I can't make a good guess as to what it should be.

A. The fourth field in the control file is simply an utterance identifier. So long as that field and the entry at the end of the corresponding utterance in the transcript file are the same, you can have anything written there and the training will go through. It is only a very convenient tag. The particular format that you see for the fourth field is just an "informative" way of tagging. Usually we use file paths and names alongwith other file attributes that are of interest to us.

Q: I am trying to train with Switchboard data. Switchboard data is mulaw encoded. Do we have generic tools for converting from stereo mulaw to standard raw file?

A. NIST provides a tool called w_edit which lets you specify the output format, the desired channel to decode and the beginning and ending sample that you would like decoded. ch_wave, a part of the Edinburgh speech tools, does this decoding as well (send mail to awb@cs.cmu.edu for more information on this). Here is a conversion table for converting 8 bit mulaw to 16 bit PCM. The usage must be clear from the table - linear_value = linear[mu_law_value]; (i.e. if your mu law value is 16, the PCM value is linear[16]);


------------- mu-law to PCM conversion table-----------------------
  static short int linear[256] = {-32124, -31100, -30076, -29052,
 -28028, -27004, -25980, -24956, -23932, -22908, -21884, -20860,
 -19836, -18812, -17788, -16764, -15996, -15484, -14972, -14460,
 -13948, -13436, -12924, -12412, -11900, -11388, -10876, -10364,
 -9852, -9340, -8828, -8316, -7932, -7676, -7420, -7164, -6908,
 -6652, -6396, -6140, -5884, -5628, -5372, -5116, -4860, -4604,
 -4348, -4092, -3900, -3772, -3644, -3516, -3388, -3260, -3132,
 -3004, -2876, -2748, -2620, -2492, -2364, -2236, -2108, -1980,
 -1884, -1820, -1756, -1692, -1628, -1564, -1500, -1436, -1372,
 -1308, -1244, -1180, -1116, -1052, -988, -924, -876, -844, -812,
 -780, -748, -716, -684, -652, -620, -588, -556, -524, -492, -460,
 -428, -396, -372, -356, -340, -324, -308, -292, -276, -260, -244,
 -228, -212, -196, -180, -164, -148, -132, -120, -112, -104, -96,
 -88, -80, -72, -64, -56, -48, -40, -32, -24, -16, -8, 0, 32124,
  31100, 30076, 29052, 28028, 27004, 25980, 24956, 23932, 22908,
  21884, 20860, 19836, 18812, 17788, 16764, 15996, 15484, 14972,
  14460, 13948, 13436, 12924, 12412, 11900, 11388, 10876, 10364,
  9852, 9340, 8828, 8316, 7932, 7676, 7420, 7164, 6908, 6652, 6396,
  6140, 5884, 5628, 5372, 5116, 4860, 4604, 4348, 4092, 3900, 3772,
  3644, 3516, 3388, 3260, 3132, 3004, 2876, 2748, 2620, 2492, 2364,
  2236, 2108, 1980, 1884, 1820, 1756, 1692, 1628, 1564, 1500, 1436,
  1372, 1308, 1244, 1180, 1116, 1052, 988, 924, 876, 844, 812,
  780, 748, 716, 684, 652, 620, 588, 556, 524, 492, 460, 428, 396,
  372, 356, 340, 324, 308, 292, 276, 260, 244, 228, 212, 196, 180,
  164, 148, 132, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40,
  32, 24, 16, 8, 0};
------------- mu-law to PCM conversion table-----------------------

back to top


SELECTING MODELING PARAMETERS

Q:How many senones should I train?

A: Thumb rule figures for the number of senones that you should be
training are given in the following table:

Amount of training data(hours) No. of senones
 

Amount of training data(hours) No. of senones
1-3 500-1000
4-6 1000-2500
6-8 2500-4000
8-10 4000-5000
10-30  5000-5500
30-60 5500-6000
60-100  6000-8000
Greater than 100 8000 are enough

Q: How many states-per-hmm should I specify for my training?

A: If you have "difficult" speech (noisy/spontaneous/damaged), use 3-state hmms with a noskip topology. For clean speech you may choose to use any odd number of states, depending on the amount of data you have and the type of acoustic units you are training. If you are training word models, for example, you might be better off using 5 states or higher. 3-5 states are good for shorter acoustic units like phones. You cannot currently train 1 state hmms with the Sphinx.

Remember that the topology is also related to the frame rate and the minimum expected duration of your basic sound units. For example the phoneme "T" rarely lasts more than 10-15 ms. If your frame rate is 100 frames per second, "T" will therefore be represented in no more than 3 frames. If you use a 5 state noskip topology, this would force the recognizer to use at least 5 frames to model the phone. Even a 7 state topology that permits skips between alternate states would force the recognizer to visit at least 4 of these states, thereby requring the phone to be at least 4 frames long. Both would be erroneous. Give this point very serious thought before you decide on your HMM topology. If you are not convinced, send us a mail and we'll help you out.

Q:I have two sets of models, A and B. The set A has been trained with 10,000 tied states (or senones) and B has been trained with 5,000 senones. If I want to compare the recognition results on a third database using A and B, does this difference in the number of senones matter?

A. If A and B have been optimally trained (i.e. the amount of data available for training each has been well considered), then the difference in the number of tied states used should not matter.

back to top


TRAINING SPEED


Q: I am trying to train models on s single machine. I just want want to train a set of coarse models for forced-alignment. The baum-welch iterations are very slow. In 24 hours, it has only gone through 800 utterances. I have total 16,000 utterances. As this speed, it will take 20 days for the first iteration of baum-welch, considering the convergence ratio to be 0.01, it will take several months to obtain the first CI-HMM, let alone CD-HMM. Is there any way to speed this up?

A: If you start from flat-initialized models the first two iterations of baum welch will always be very slow. This is because all paths through the utterance are similar and the algorithm has to consider all of them. In the higher iterations, when the various state distributions begin to differ from each other, the computation speeds up a great deal.

Given the observed speed of your machine, you cannot possibly hope to train your models on a single machine. You may think of assigning a lower value to the "topn" argument of the bw executable, but since you are training CI models, changing the topn value from its default (99) to any smaller number will not affect the speed, since there is only at best 1 Gaussian per state anyway throughout the computation.

Try to get more machines to share the jobs. There is a -npart option to help you partition your training data. Alternatively, you can shorten your training set, since you only want to use the models for forced alignment. Models trained with about 10 hours of data will do the job just as well.

back to top
POST-TRAINING DISASTERS

Q: I've trained with clean speech. However, when I try to decode noisy speech with my models, the decoder just dies. shouldn't it give at least some junk hypothesis?

A. Adding noise to the test data increases the mismatch between the models and test data. So if the models are not really well trained (and hence not very generalizable to slightly different data), the decoder dies. There are multiple resons for this:

One way to solve this problem is just to retrain with noisy data.

Q: I've trained models but I am not able to decode. The decoder settings seem to be ok. It just dies when I try to decode.

A. If all flag setting are fine, then decoder is probably dying becuase the acoustic models are bad. This is because of multiple reasons a) All paths that lead to a valid termination may get pruned out b) The likelihood of the data may be so poor that the decoder goes into underflow. This happens if even *only one* of your models is very badly trained. The likelihood of this one model becomes very small and the resulting low likelihood get inverted to a very large positive number becuase the decoder uses integer arithmetic, and results in segmentation errors, artihmetic errors, etc.

You'll probably have to retrain the models in a better way. Force-align properly, make sure that all phones and triphones that you do train are well represented in your training data, use more data for training if you can, check your dictionaries and use correct pronunciations, etc.

Q: I started from one set of models, and trained further using another bunch of data. This data looked more like my test data, and there was a fair amount of it. So my models should have improved. When I use these models for recognition, however, the performance of the system is awful. What went wrong?

A: The settings use to train your base models may have differed in one or more ways from the settings you used while training with the new data. The most dangerous setting mismatches is the agc (max/none). Check the other settings too, and finally make sure that during decoding you use the same agc (and other relevant settings like varnorm and cmn) during training.

back to top


QUESTIONS SPECIFIC TO LOG-FILE OUTPUTS
Q. My decode log file gives the following message:

ERROR: "../feat.c", line 205: Header size field: -1466929664(a8906e00); filesize: 1(00000001)
================
exited with status 0

A. The feature files are byte swapped!

Q. During force-alignment, the log file has many messages which say "Final state not reached" and the corresponding transcripts do not get force-aligned. What's wrong?

A. The message means that the utterance likelihood was very low, meaning in turn that the sequence of words in your transcript for the corresponding feature file given to the force-aligner is rather unlikely. The most common reasons are that you may have the wrong model settings or the transcripts being considered may be inaccurate. For more on this go to Viterbi-alignment

Q. I am trying to do flat-initialization for training ci models. The cp_parm program is complaining about the -feat option. The original script did not specify a -feat option, however the cp_parm program complained that the default option was unimplemented. I've made several attempts at specifing a -feat option with no luck. Below is the output of two run. Can you give me an idea of what is happening here?

Default (no -feat passed) produces:

-feat     c[1..L-1]d[1..L-1]c[0]d[0]dd[0]dd[1..L-1]
c[1..L-1]d[1..L-1]c[0]d[0]dd[0]dd[1..L-1]
ERROR: "../feat.c", line 121: Unimplemented feature
c[1..L-1]d[1..L-1]c[0]d[0]dd[0]dd[1..L-1]
ERROR: "../feat.c", line 122: Implemented features are:
        c/1..L-1/,d/1..L-1/,c/0/d/0/dd/0/,dd/1..L-1/
        c/1..L-1/d/1..L-1/c/0/d/0/dd/0/dd/1..L-1/
        c/0..L-1/d/0..L-1/dd/0..L-1/
        c/0..L-1/d/0..L-1/
INFO: ../s3gau_io.c(128): Read
[path]/model_parameters/new_fe.ci_continuous_flatinitial
/globalmean [1x1x1 array]
gau 0 <= 0
gau 1 <= 0
gau 2 <= 0
This is the error message if I attempt to specify the -feat option:
 -feat     c[1..L-1]d[1..L-1]c[0]d[0]dd[0]dd[1..L-1]
....

ERROR: "../feat.c", line 121: Unimplemented feature
c[1..L-1]d[1..L-1]c[0]d[0]dd[0]dd[1..L-1]
ERROR: "../feat.c", line 122: Implemented features are:
        c/1..L-1/,d/1..L-1/,c/0/d/0/dd/0/,dd/1..L-1/
        c/1..L-1/d/1..L-1/c/0/d/0/dd/0/dd/1..L-1/
        c/0..L-1/d/0..L-1/dd/0..L-1/
        c/0..L-1/d/0..L-1/

A. The last three lines in the case when you do not specify the -feat option say that the cp_parm is going through and the mean vector labelled "0" is being copied to state 0, state 1, state 2.... The same "0" vector is being copied because this is a flat_initialization where all means, variances etc are given equal flat values. At this point, these errors in the log files can just be ignored.

Q. I am trying to make linguistic questions for state tying. The program keeps failing because it can't allocate enough memory. Our machines are rather large with 512MB and 1 to 2 GB swap space. Does it make sense that it really doesn't have enough memory, or is it more likely something else failed? Below is the log from this program.

 -varfn
{path]/model_parameters/new_fe.ci_continuous/variances \
 -mixwfn
[path]/model_parameters/new_fe.ci_continuous/mixture_weights \
 -npermute 168 \
 -niter 0 \
 -qstperstt 20 \
.....
.....
.....
INFO: ../s3gau_io.c(128): Read
/sphx_train/hub97/training/model_parameters/new_fe.ci_continuous/means
[153x1x1 array]
INFO: ../s3gau_io.c(128): Read
/sphx_train/hub97/training/model_parameters/new_fe.ci_continuous/variances
[153x1x1 array]
FATAL_ERROR: "../ckd_alloc.c", line 109: ckd_calloc_2d failed for caller at
../main.c(186) at ../ckd_alloc.c(110)

A. make_quests searches 2^npermute combinations several times for the optimal clustering of states. For this, it has to store 2^npermute values (for the comparison). So, setting -npermute to anything greater than 8 or 10 makes the program very slow, and anything over 28 will make the program fail. We usually use a value of 8.

Q. I'm getting a message about end of data beyond end of file from agg_seg during vector-quantization. I assume this means the .ctl file references a set of data beyond the end of the file. Should I ignore this?

A. Yes, for agg_seg if its going through in spite of the message. Agg-seg only collects samples of feature vectors to use for quantization through kmeans. No, for the rest of the training because it may cause random problems. The entry in the control file and the corresponding transcript have to be removed, if you cannot correct them for some reason.

back to top


VECTOR-QUANTIZATION FOR DISCRETE AND SEMI-CONTINUOUS MODELS

Q. I have a question about VQ. When you look at the 39-dimensional [cep + d-cep + dd-cep ] vector, it's clear that each part (cep, d-cep, dd-cep) will have quite a different dynamic range and different mean. How should we account for this when doing DISCRETE HMM modeling? Should we make a separate codebook for each? If so, how should we "recombine" when recognizing? Or should we rescale the d-cep and dd-cep up so they can "compete" with the "larger" cep numbers in contributing to the overall VQ?

In other words, suppose we want to train a complete discrete HMM system - is there a way to incorporate the d-cep and dd-cep features into the system to take advantage of their added information? If we just concatenate them all into one long vector and do standard VQ, the d-cep and dd-cep won't have much of an influence as to which VQ codebook entry matches best an incoming vector. Perhaps we need to scale up the d-cep and dd-cep features so they have the same dynamic range as the cep features? Is there a general strategy that people have done in the past to make this work? Or do we have to "bite the bullet" and move up to semi-continuous HMM modeling?

A: You *could* add d-cep and dd-cep with the cepstra into one long feature. However, this is always inferior to modeling them as separate feature streams (unless you use codebooks with many thousand codewords).

Secondly, for any cepstral vector, the dynamic range and value of c[12], for example, is much smaller (by orders of magnitude) than c[1] and doesnt affect the quantization at all. In fact, almost all the quantization is done on the basis of the first few cepstra with the largest dynamic ranges. This does not affect system performance in a big way. One of the reasons is that the classification information in the features that do not affect VQ much is also not too great.

However, if you do really want to be careful with dynamic ranges, you could perform VQ using Mahalanobis distances, instead of Euclidean distances. In the Mahalanbis distance each dimension is weighted by the inverse of the standard deviation of that component of the data vectors. e.g. c[12] would be weighted by (1/std_dev(c[12])). The standard deviations could be computed either over the entire data set (based on the global variance) or on a per-cluster basis (you use the standard deviation of each of the clusters you obtain during VQ to weight the distance from the mean of that cluster). Each of these two has a slightly different philisophy, and could result in slightly different results.

A third thing you could do is to compute a Gaussian mixture with your data, and classify each data vector (or extended data vector, if you prefer to combine cepstra/dcep/ddcep into a single vector) as belonging to one of your gaussians. You then use the mean of that Gaussian as the codeword representing that vector. Dynamic ranges of data will not be an issue at all in this case.

Note: In the sphinx, for semi-continuous modeling, a separate codebook is made for each of the four feature streams: 12c,24d,3energy,12dd. Throughout the training, the four streams are handled independently of each other and so in the end we have four sets of mixture weights corresponding to each senone or hmm state. The sphinx does not do discrete modeling directly.

Q. For vector-quantization, should the control file entires correspond exactly to the transcript file entries?

A. For the vq, the order of files in the ctl need not match the order of transcripts. However, for the rest of the training, the way our system binaries are configured, there has to be an exact match. The vq does not look at the transcript file. It just groups data vectors (which are considered without reference to the transcripts).

Q. What is the difference between the -stride flag in agg-seg and kmeans-init?

A. -stride in agg-seg samples the feature vectors at stride^th intervals , the vectors are then used for VQ. In the kmeans-init program its function is the same, but this time it operates on the vectors already accumulated by agg-seg, so we usually set it to 1.

Q. Regarding the size of the VQ Codebook: is there something to say that the size 256 optimal? Would increasing the size affect the speed of decoding?

A. For more diverse acoustic environments, having a larger codebook size would result in better models and better recognition. We have been using 256 codewords primarily for use with the SPHINX-II decoder, since for historical reasons it does not handle larger codebbok sizes. The original sphinx-II used a single byte integer to index the codewords. The largest number possible was therefore 256. The format conversion code which converts models from SPHINX-III format to SPHINX-II format accordingly requires that your models be trained with a codebook size of 256.

The standard Sphinx-III decoder, however, can handle larger codebooks. Increasing the codebook size would slow down the speed of decoding since the the number of mixture-weights would be higher for each HMM state.

Q. I am trying to do VQ. It just doesn't go through. What could be wrong?

A. Its hard to say without looking at the log files. If a log file is not being generated, check for machine/path problems. If it is being generated, here are the common causes you can check for:

  1. byte-swap of the feature files
  2. negative file lengths
  3. bad vectors in the feature files, such as those computed from headers
  4. the presence of very short files (a vector or two long)

back to top


UPDATING EXISTING MODELS

Q.I have 16 gaussian/state continuous models, which took a lot of time to train. Now I have some more data and would like to update the models. Should I train all over again starting with the tied mdef file (the trees)?

A. Training from the trees upto 16 or 32 gaussians per state takes a lot of time. If you have more data from the same domain or thereabouts, and just want to update your acoustic models, then you are probably better off starting with the current 16 or 32 gaussians/state models and running a few iterations of baum-welch from there on with *all* the data you have. While there would probably be some improvment if you started from the trees I dont think it would be very different from iterating from the current models. You *would* get better models if you actually built the trees all over again using all the data you have (since they would now consider more triphones), but that would take a long time.

Q.I have a set of models A, which have a few filler phones. I want to use additional data from another corpus to adapt the model set A to get a new adapted model set B. However, the corpus for B has many other filler phones which are not the same as the filler models in set A. What do I do to be able to adapt?

A. Edit the filler dictionary and insert the fillers you want to train. Map each filler in B to a filler phone (or a sequence of phones) in model set A. for example

++UM++       AX M
++CLICK++    +SMACK+
++POP++      +SMACK+
++HMM++      HH M
++BREATH++   +INHALE+
++RUSTLE++   +INHALE+
On the LHS, list the fillers in B. On the RHS, put in the corresponding fillers (or phones) in A. In this case, it will be a many-to-one mapping from B to A.

To force-align, add the above filler transcriptions to the *main* dictionary used to force-align.

back to top


UTTERANCE, WORD AND PHONE SEGMENTATIONS

Q. How do I use the sphinx-3 decoder to get phone segmentations?

A. The decoder works at the sentence level and outputs word level segmentations. If your "words" are phones, you have a phone-decoder and you can use the -matchsegfn flag to write the phone segmentations into a file. If your words are not phones (and are proper words), then write out matchseg files (using the -matchsegfn option rather than the -matchfn option), pull out all the words from the output matchseg files *including all noises and silences* and then run a force- alignment on the corresponding pronunciation transcript to get the phone segmentation. You will have to remove the <s>, <sil> and </s> markers before you force-align though, since the aligner introduces them perforce.

Q. How do I obtain word segmentations corresponding to my transcripts?

A. You can use the SPHINX decoder to obtain phone or word level segmentations. Replacing the flag -matchfn with -matchsegfn in your decode options will generate the hypotheses alongwith word segmentations in the matchfile. You can run a phone level decode in a similar way to obtain phone segmentations.

Q. The recordings in my training corpus are very long (about 30 minutes each or more). Is there an easy way to break them up into smaller utterances?

A. One easy way to segment is to build a language model from the transcripts of the utterances you are trying to segment, and decode over 50 sec. sliding windows to obtain the word boundaries. Following this, the utterances can be segmented (say) at approx. 30 sec. slots. Silence or breath markers are good breaking points.

There are other, better ways to segment, but they are meant to do a good job in situations where you do not have the transcripts for your recordings (eg. for speech that you are about to decode). They will certainly be applicable in situations where you do have transcripts, but aligning your transcripts to the segments would involve some extra work.

back to top


FORCE-ALIGNMENT (VITERBI ALIGNMENT)

Q. Will the forced-aligner care if I leave the (correct) alternate pronunciation markers in the transcript? Or do I need to remove them?

The force-aligner strips off the alternate pronunciation markers and re-chooses the correct pronunciation from the dictionary.

Q.Some utterances in my corpus just don't get force-aligned. The aligner dies on them and produces no output. what's wrong?

A. Firstly, let's note that "force-alignment" is CMU-specific jargon. The force-aligner usually dies on some 1% of the files. If the models are good, it dies in fewer cases. Force-alignment fails for various reasons - you may have spurious phones in your dictionary or may not have any dictionary entry for one or more words in the transcript, the models you are using may have been trained on acoustic conditions which do not match the conditions in the corpus you are trying to align, you may have trained initial models with transcripts which are not force-aligned (this is a standard practice) and for some reason one or more of the models may have zero parameter values, you may have bad transcriptions or may be giving the wrong transcript for your feature files, there may be too much noise in the current corpus, etc. The aligner does not check whether your list of feature files and the transcript file entries are in the same order. Make sure that you have them in order, where there is a one-to-one correspondence between the two files. If these files are not aligned, the aligner will not align most utterances. The ones that do get aligned will be out of sheer luck and the alignments will be wrong.

There may be another reason for alignment failure: if you are force-aligning using a phoneset which is a subset of the phones for which you have context-dependent models (such that the dictionary which was used to train your models has been mapped on to a dictionary with lesser phones), then for certain acoustic realizations of your phones, the context-dependent models may not be present. This causes the aligner to back up to context-idependent (CI) models, giving poor likelihoods. When the likelihoods are too poor, the alignment fails. Here's a possible complication: sometimes in this situation, the backoff to CI models does not work well (for various reasons which we will not discuss here). If you find that too many of your utterances are not getting force-aligned and suspect that this may be due to the fact that you are using a subset of the phone-set in the models used for alignment, then an easy solution is to temporarily restore the full phoneset in your dictionary for force-alignment, and once it is done, revert to the smaller set for training, without changing the order of the dictionary entries.

After Viterbi-alignment, if you are still left with enough transcripts to train, then it is a good idea to go ahead and train your new models. The new models can be used to redo the force-alignment, and this would result in many more utterances getting successfuly aligned. You can, of course, iterate the process of training and force-alignment if getting most of the utterances to train is important to you. Note that force-alignmnet is not necessary if a recognizer uses phone-networks for training. However, having an explicit aligner has many uses and offers a lot of flexibility in many situations.

Q. I have a script for force-alignment with continuous models. I want to force-align with some semi-continuous models that I have. What needs to change in my script?

A. In the script for force-alignment, apart from the paths and model file names, the model type has to be changed from ".cont" to ".semi" and the feature type has to be changed to "s2_4x", if you have 4-stream semi-continuous models.

Q. I'm using sphinx-2 force-aligner to do some aligning, it basically works but seems way too happy about inserting a SIL phone between words (when there clearly isn't any silence). I've tried to compensate with this by playing with the -silpen but it didn't help. why does the aligner insert so many spurious silences?

A. The problem may be due to many factors. Here's a checklist that might help you track down the problem:

  1. Is there an agc mismatch between your models and your force-aligner settings? If you have trained your models with agc "max" then you must not set agc to "none" during force-alignment (and vice-versa).
  2. Listen to the words which are wrongly followed by the SIL phone after force-alignment. If such a word clearly does not have any silence following it in the utterance, then check the pronunciation of the word in your dictionary. If if the pronunciation is not really correct (for example if you have a flapped "R " in place of a retroflexed "R " or a "Z " in place of an "S " (quite likely to happen if the accent is non-native), the aligner is likely to make an error and insert a silence or noise word in the vicinity of that word.
  3. Are your features being computed exactly the same way as the features that were used to train the acoustic models that you are using to force-align? Your parametrization can go wrong even if you are using the *same* executable to compute features now as you used for training the models. If, for example, your training features were computed at the standard analysis rate of 100 frame/sec with 16khz, 16bit sampling, and if you are now assuming either an 8khz sampling rate or 8 bit data in your code, you'll get twice as many frames as you should for any given utterance. With features computed at this rate, the force-aligner will just get silence-happy.
  4. Are the acoustic conditions and *speech* bandwidth of the data you are force-aligning the same as those for which you have acoustic models? For example, if you are trying to force-align the data recorded directly off your TV with models built with telephone data, then even if your sampling rate is the same in both cases, the alignment will not be good.
  5. Are your beams too narrow? Beams should typically be of the order of 1e-40 to 1e-80. You might mistakenly have them set at a much higher value (which means much *narrower* beams).

    back to top


    BAUM-WELCH ITERATIONS AND ASSOCIATED LIKELIHOODS

    Q. How many iterations of Baum-Welch should I run for CI/CD-untied/CD-tied training?

    A. 6-10 iterations are good enough for each. It is better to check the ratio of total likelihoods from the previous iteration to the current one to decide if a desired convergence ratio has been achieved. The scripts provided with the SPHINX package keep track of these ratios to automatically decide how many iterations to run, based on a "desired" convergence ratio that you must provide. If you run too many iterations, the models get overfitted to the training data. You must decide if you want this to happen or not.

    Q. The training data likelihoods at the end of my current iteration of Baum-Welch training are identical to the likelihoods at the end of the previous iteration. What's wrong and why are they not changing?

    A. The most likely reason is that for some reason the acoustic models did not get updated on your disk at the end of the previous iteration. When you begin with the same acoustic models again and again, the likelihoods end up being the same every time.

    Q. The total likelihood at the end of my current Baum-Welch iteration is actually lower than the likelihood at the end of the previous iteration. Should this happen?

    A. Theoretically, the likelihoods must increase monotonically. However, this condition holds only when the training data size is constant. In every iteration (especially if your data comes from difficult acoustic conditions), the Baum-Welch algorithm may fail in the backward pass on some random subset of the utterances. Since the effective training data size is no longer constant, the likelihoods may actually decrease at the end of the current iteration, compared to the previous likelihoods. However, this should not happen very often. If it does, then you might have to check out your transcripts and if they are fine, you might have to change your training strategy in some appropriate manner.

    Q. In my training, as the forward-backward (Baum-Welch) iterations progress, there are more and more error messages in the log file saying that the backward pass failed on the given utterance. This should not happen since the algorithm guarantees that the models get better with every iteration. What's wrong?

    A. As the models get better, the "bad" utterances are better identified through their very low likelihoods, and the backward pass fails on them. The data may be bad due to many reasons, the most common one being noise. The solution is to train coarser models, or train fewer triphones by setting the "maxdesired" flag to a lower number (of triphones) when making the untied mdef file, which lists the triphones you want to train. If this is happening during CI training, check your transcripts to see if the within-utterance silences and non-speech sounds are transcribed in appropriate places, and if your transcriptions are correct. Also check if your data has difficult acoustic conditions, as in noisy recordings with non-stationary noise. If all is well and the data is very noisy and you can't do anything about it, then reduce the number of states in your HMMs to 3 and train models with a noskip topology. If the utterances still die, you'll just have to live with it. Note that as more and more utterances die, more and more states in your mdef file are "not seen" during training. The log files will therefore have more and more messages to this effect.

    Q. My baum-welch training is really slow! Is there something I can do to speed it up, apart from getting a faster processor?

    A. In the first iteration, the models begin from flat distributions, and so the first iteration is usually very very slow. As the models get better in subsequent iterations, the training speeds up. There are other reasons why the iterations could be slow: the transcripts may not be force-aligned or the data may be noisy. For the same amount of training data, clean speech training gets done much faster than noisy speech training. The noisier the speech, the slower the training. If you have not force-aligned, the solution is to train CI models, force-align and retrain. If the data are noisy, try reducing the number of HMM states and/or not allowing skipped states in the HMM topology. Force-alignment also filters out bad transcripts and very noisy utterances.

    Q. The first iteration of Baum-Welch through my data has an error:

     INFO: ../main.c(757): Normalizing var
     ERROR: "../gauden.c", line 1389: var (mgau=0, feat=2, density=176,
    component=1) < 0
    
    Is this critical?

    A.This happens because we use the following formula to estimate variances:

    variance = avg(x2) - [avg(x)]2

    There are a few weighting terms included (the baum-welch "gamma" weights), but they are immaterial to this discussion. The *correct* way to estimate variances is

    variance = avg[(x - avg(x)]2)

    The two formulae are equivalent, of course, but the first one is far more sensitive to arithmetic precision errors in the computer and can result in negative variances. The second formula is too expensive to compute (we need one pass through the data to compute avg(x), and another to compute the variance). So we use the first one in the sphinx and we therefore get the errors of the kind we see above, sometimes.

    The error is not critical (things will continue to work), but may be indicative of other problems, such as bad initialization, or isolated clumps of data with almost identical values (i.e. bad data).

    Another thing that usually points to bad initialization is that you may have mixture-weight counts that are exactly zero (in the case of semi-continuous models) or the gaussians may have zero means and variances (in the case of continuous models) after the first iteration.

    If you are computing semi-continuous models, check to make sure the initial means and variances are OK. Also check to see if all the cepstra files are being read properly.

    back to top


    DICTIONARIES, PRONUNCIATIONS AND PHONE-SETS

    Q.I've been using a script from someone that removes the stress markers in cmudict as well as removes the deleted stops. This script is removing the (2) or (3) markers that occur after multiple pronunciations of the same word. That is,

     A  EY
     A  AX
    
    is produced instead of
    A    EY
    A(2) AX
    
    What is the consequence of removing this muliple pronunciation marker? Will things still work?

    A. The (2), (3) etc. are important for the training. It is the only way the trainer knows which pronunciation of the word has been used in the utterance, and that is what the force-aligner decides for the rest of the training. So, once the force-alignment is done, the rest of the training has to go through with the same dictionary, and neither the pronunciations nor the pronunciation markers should change.

    Independently of this, the script that you are using should be renumbering the dictionary pronunciations in the manner required by the trainer in order for you to use it for training and decoding. Pronunciation markers are required both during training and during decoding.

    Q.I have trained a set of models, and one of the phones I have trained models for is "TS" (as in CATS = K AE TS). Now I want to remove the phone TS from the dictionary and do not want to retain it's models. What are the issues involved?

    A. You can change every instance of the phone "TS" in your decode dictionary to "T S". In that case, you need not explicitly remove the models for TS from your model set. Those models will not be considered during decoding. However, if you just remove TS from the decode dictionary and use the models that you have, many of the new triphones involving T and S would not have corresponding models (since they were not there during training). This will adversely affect recognition performance. You can compose models for these new triphones from the existing set of models by making a new tied-mdef file with the new decode dictionary that you want to use. This is still not as good as training explicitly for those triphones, but is better than not having the triphones at all. The ideal thing to do would be to train models without "TS" in the training dictionary as well, because replacing TS with T S will create new triphones. Data will get redistributed and this will affect the decision trees for all phones, especially T ans S. When decision trees get affected, state tying gets affected, and so the models for all phones turn out to be slightly different.

    Q. What is a filler dictionary? What is its format?

    A. A filler dictionary is like any dictionary, with a word and its pronunciation listed on a line. The only difference is that the word is what *you* choose to call a non-speech event, and its pronunciation is given using whatever filler phones you have models for (or are building models for). So if you have models for the phone +BREATH+, then you can compose a filler dictionary to look like

    ++BREATHING++ +BREATH+

    or

    BREATH_SOUND +BREATH+

    or...

    The left hand entry can be anything (we usually just write the phone with two plus signs on either side - but that's only a convention).

    Here's an example of what a typical filler dictionary looks like:

    ++BREATH++                     +BREATH+
    ++CLICKS++                     +CLICKS+
    ++COUGH++                      +COUGH+
    ++LAUGH++                      +LAUGH+
    ++SMACK++                      +SMACK+
    ++UH++                         +UH+
    ++UHUH++                       +UHUH+
    ++UM++                         +UM+
    ++FEED++                       +FEED+
    ++NOISE++                      +NOISE+
    
    When using this with SPHINX-III, just make sure that there are no extra spaces after the second column word, and no extra empty lines at the end of the dictionary

    back to top


    DECISION-TREE BUILDING AND PARAMETER SHARING

    Q. In HTK, after we do decision-tree-driven state-clustering, we run a "model compression" step, whereby any triphones which now (after clustering) point to the same sequence of states are mapped, so that they are effectively the same physical model. This would seem to have the benefit of reducing the recognition lattice size (although we've never verified that HVite actually does this.) Do you know if Sphinx 3.2 also has this feature?

    A. The sphinx does not need to do any compression because it does not physically duplicate any distributions. all state-tying is done through a mapping table (mdef file), which points each state to the appropriate distributions.

    Q. The log file for bldtree gives the following error:

    INFO: ../main.c(261): 207 of 207 models have observation count greater than
    0.000010
    FATAL_ERROR: "../main.c", line 276: Fewer state weights than states
    

    A. The -stwt flag has fewer arguments that the number of HMM-states that you are modeling in the current training. The -stwt flag needs a string of numbers equal to the number of HMM-states, for example, if you were using 5-state HMMs, then the flag could be given as "-stwt 1.0 0.3 0.1 0.01 0.001". Each of these numbers specify the weights to be given to state distributions during tree building, beginning with the *current* state. The second number specifies the weight to be given to the states *immediately adjacent* to the current state (if there are any), the third number specifies the weight to be given to adjacent states *one removed* from the immediately adjacent one (if there are any), and so on.

    back to top


    FEATURE COMPUTATION

    Q. How appropriate are the standard frame specifications for feature computation? I am using the default values but the features look a bit "shifted" with respect to the speech waveform. Is this a bug?

    A. There are two factors here: the frame *size* and the frame *rate*. Analysis frame size is typically 25 ms. Frame rate is 100 frames/sec. In other words, we get one frame every 10 ms (a nice round number), but we may need to adjust boundaries a little bit because of the frame size (a 5ms event can get smeared over three frames - it could occur in the tail end of one frame, the middle of the next one, and the beginning of the third, for the 10ms frame shifts). The feature vectors sometimes look shifted with respect to the speech samples. However, there is no shift between the frames and the speech data. Any apparent shift is due to smearing. We do frequently get an additional frame at the end of the utterance because we pad zeros, if necessary, after the final samples in order to fill up the final frame.

    Q. How do I find the center frequencies of the Mel filters?

    A. The mel function we use to find the mel frequency for any frequency x is

    (2595.0*(float32)log10(1.0+x/700.0))

    substitute x with the upper and lower frequencies, subtract the results, and divide by the number of filters you have + 1 : that will give you the bandwidth of each filter as twice the number you get after division. The number you get after division + the lower frequency is the center frequency of the first filter. The rest of the center frequencies can be found by using the bandwidths and the knowledge that the filters are equally spaced on the mel frequency axis and overlap by half the bandwidth. These center frequencies can be transformed back to normal frequency using the inverse mel function

    (700.0*((float32)pow(10.0,x/2595.0) - 1.0))

    where x is now the center frequency.

    Q. Does the front-end executable compute difference features?

    A. No. The difference features are computed during runtime by the SPHINX-III trainer and decoders.

    Q. What would be the consequence of widening the analysis windows beyond 25ms for feature computation?

    Analysis windows are currently 25 ms wide with 10ms shifts. Widening them would have many undesirable consequences:

    Q. I ran the new wave2feat program with the configuration of srate = 16000 and nfft = 256, then the program crashed. I changed the nfft to 512, it works. So I'd like to know why.

    A. At a sampling rate of 16000 samples/sec, a 25ms frame has 400 samples. If you try to fill these into 256 locations of allocated memory (in the FFT) you will have a segmentation fault. There *could* have been a check for this in the FFT code, but the default for 16kHz has been set correctly to be 512, so this was considered unnecessary.

    back to top


    MODELING FILLED PAUSES AND NON-SPEECH EVENTS

    Q. Can you explain the difference between putting the words as fillers ++()++ instead of just putting them in the normal dictionary? My dictionary currently contains pronunciations for UH-HUH, UH-HUH(2) and UH-HUH(3). Should all of these effectively be merged to ++UH-HUH++ and mapped to a single filler phone like +UH-HUH+?

    A. Putting them as normal words in the dictionary should not matter if you are training CI models. However, at the CD stage when the list of training triphones is constructed, the phones corresponding to the (++ ++) entries are mapped by the trainer to silence. For example the triphone constructed from the utterance

    ++UM++ A ++AH++

    would be AX(SIL,SIL) and not AX(+UM+,AA) [if you have mapped ++UM++ to +UM+ and ++AH++ to the phone AA for training, in one of the training dictionaries]

    Also, when you put ++()++ in the main dictionary and map it to some sequence of phones other than a single +()+ phone, you cannot build a model for the filler. For example UH-HUH may be mapped to AH HH AX , AX HH AX etc in the main dict, and when you train, the instances of UH-HUH just contribute to the models for AH, AX or HH and the corresponding triphones. On the other hand, if you map ++UH-HUH++ to +UH-HUH+, you can have the instances contribute exclusively to the phone +UH-HUH+. The decision to keep the filler as a normal word in the training dictionary and assign alternate pronunciations to it OR to model it exclusively by a filler phone must be judiciously made keeping the requirements of your task in mind.

    During decoding and in the language model, the filler words ++()++ are treated very differently from the other words. The scores associated are computed in a different manner, taking certain additional insertion penalties into account.

    Also, the SPHINX-II decoder is incapable of using a new filler unless there is an exclusive model for it (this is not the case with the SPHINX-III decoder). If there isn't, it will treat the filler as a normal dictionary word and will ignore it completely if it is not there in the language model (which usually doesn't have fillers), causing a significant loss in accuracy for some tasks.

    Q. My training data contains no filler words (lipsmack, cough etc.) Do you think I should retrain trying to insert fillers during forced alignment so that I could train on them? Since what I have is spontaneous speech, I can't imagine that in all 20000 utterances there are no filled pauses etc.

    A. Don't use falign to insert those fillers. The forced aligner has a tendency to arbitrarily introduce fillers all over the place. My guess is that you will lose about 5%-10% relative by not having the fillers to model. If you are going to use the SPHINX-III decoder, however, you can compose some improtant fillers like "UH" and "UM" as "AX" or "AX HH" or "AX M" and use them in the fillerdict. However, the sphinx-2 decoder cannot handle this. If possible, try listening to some utterances and see if you can insert about 50 samples of each filler - that should be enough to train them crudely.

    Q. How is SIL different from the other fillers? Is there any special reason why I should designate the filler phones as +()+? What if I *want* to make filler triphones?

    A.Silence is special in that it forms contexts for triphones, but doesn't have it's own triphones (for which it is the central phone, ie). The fillers neither form contexts nor occur as independent triphones. If you want to build triphones for a filler, then the filler must be designated as a proper phone without the "+" in the dictionaries.

    Q. What is the meaning of the two columns in the fillerdict? I want to reduce the number of fillers in my training.

    In a filler dictionary, we map all non-speech like sounds to some phones and we then train models for those phones. Forexample, we may say

    ++GUNSHOT++     +GUNSHOT+
    
    The meaning is the same as "the pronunciation of the word ++GUNSHOT++ in the transcripts must be interpreted to be +GUNSHOT+" Now if I have five more filler words in my transcripts:
    ++FALLINGWATER++
    ++LAUGH++
    ++BANG++
    ++BOMBING++
    ++RIFLESHOT++
    
    Then I know that the sounds of ++BANG++, ++BOMBING++ and ++RIFLESHOT++ are somewhat similar, so I can reduce the number of filler phones to be modelled by modifying the entries in the filler dict to look like
    ++GUNSHOT++     +GUNSHOT+
    ++BANG++        +GUNSHOT+
    ++BOMBING++     +GUNSHOT+
    ++RIFLESHOT++   +GUNSHOT+
    ++FALLINGWATER++        +WATERSOUND+
    ++LAUGH++       +LAUGHSOUND+
    
    so we have to build models only for the phones +GUNSHOT+, +WATERSOUND+ and +LAUGHSOUND+ now.


    WHY IS MY RECOGNITION ACCURACY POOR?

    Q. I am using acoustic models that were provided with the SPHINX package on opensource. The models seem to be really bad. Why is my recognition accuracy so poor?

    A. The reason why you are getting poor recognition with the current models is that they are not trained with data from your recording setup. while they have been trained with a large amount of data, the acoustic conditions specific to your recording setup may not have been encountered during training and so the models may not be generalizable to your recordings. More than noise, training under matched conditions makes a huge difference to the recognition performance. There may be other factors, such as feature set or agc mismatch. Check to see if you are indeed using all the models provided for decoding. For noisy data, it is important to enter all the relevant noise models (filler models) provided in the noise dictionary that is being used during decoding.

    To improve the performance, the models must be adapted to the kind of data you are trying to recognize. If it is possible, collect about 30 minutes (or more if you can) of data from your setup, transcribe them carefully, and adapt the existing models using this data. This will definitely improve the recognition performance on your task.

    It may also be that your task has a small, closed vocabulary. In that case having a large number of words in the decode dictionary and language model may actually cause acoustic confusions which are entirely avoidable. All you have to do in this situation is to retain *only* the words in your vocabulary in the decode dictionary. If you can build a language model with text that is exemplary of the kind of language you are likely to encounter in your task, it will boost up the performance hugely.

    It may also be that you have accented speech for which correct pronunciations are not present in the decode dictionary. Check to see if that is the case, and if is, then it would help to revise the dictionary pronunciations, add newer variants to existing pronunciations etc. also check to see if you have all the words that you are trying to recognize in your recognition dictionary.

    If you suspect that noise is a huge problem, then try using some noise compensation algorithm on your data prior to decoding. Spectral subtraction is a popular noise compensation method, but it does not always work.

    All this, of course, assuming that the signals you are recording or trying to recognize are not distorted or clipped due to hardware problems in your setup. Check out especially the utterances which are really badly recognized by actually looking at a display of the speech signals. In fact, this is the first thing that you must check.


    INTERPRETING SPHINX-II FILE FORMATS
    You can read more about the SPHINX-II file formats here

    Q. (this question is reproduced as it was asked!) I was trying to read the SphinxII HMM files (Not in a very good format). I read your provided "SCHMM_format" file with your distribution. But, life is never that easy, the cHMM files format should have been very easy and straight forward..!!!
    From your file ...
    chmm FILES
    There is one *.chmm file per ci phone. Each stores the transition matrix associated with that particular ci phone in following binary format. (Note all triphones associated with a ci phone share its transition matrix)

    (all numbers are 4 byte integers):
    
    -10     (a  header to indicate this is a tmat file)
    256     (no of codewords)
    5       (no of emitting states)
    6       (total no. of states, including non-emitting state)
    1       (no. of initial states. In fbs8 a state sequence can only begin
             with state[0]. So there is only 1 possible initial state)
    0       (list of initial states. Here there is only one, namely state 0)
    1       (no. of terminal states. There is only one non-emitting terminal
    state)                                                                     
    5       (id of terminal state. This is 5 for a 5 state HMM)
    14      (total no. of non-zero transitions allowed by topology)
    [0 0 (int)log(tmat[0][0]) 0]   (source, dest, transition prob, source id)
    [0 1 (int)log(tmat[0][1]) 0]
    [1 1 (int)log(tmat[1][1]) 1]
    [1 2 (int)log(tmat[1][2]) 1]
    [2 2 (int)log(tmat[2][2]) 2]
    [2 3 (int)log(tmat[2][3]) 2]
    [3 3 (int)log(tmat[3][3]) 3]
    [3 4 (int)log(tmat[3][4]) 3]
    [4 4 (int)log(tmat[4][4]) 4]
    [4 5 (int)log(tmat[4][5]) 4]
    [0 2 (int)log(tmat[0][2]) 0]
    [1 3 (int)log(tmat[1][3]) 1]
    [2 4 (int)log(tmat[2][4]) 2]
    [3 5 (int)log(tmat[3][5]) 3]
    
    There are thus 65 integers in all, and so each *.chmm file should be 65*4 = 260 bytes in size.
    ...
    that should have been easy enough, until I was surprised with the fact that the probabilities are all written in long (4 bytes) format although the float is also 4 bytes so no space reduction is achieved, Also they are stored LOG and not linear although overflow considerations (the reasons for taking the logs are during run time not in the files...)
    All this would be normal and could be achieved ....
    but when I opened the example files I found very strange data that would not represent any linear or logarithmic or any format of probability values That is if we took the file "AA.chmm" we would find that the probabilities from state 0 to any other state are written in hex as follows:
    (0,0) 00 01 89 C7
    (0,1) 00 01 83 BF
    (0,2) 00 01 10 AA
    
    As I recall that these probabilities should all summate to "1". Please, show me how this format would map to normal probabilities like 0.1, 0.6, 0.3 ...

    A. First - we store integers for historic reasons. This is no longer the case in the Sphinx-3 system. The sphinx-2 is eventually going to be replaced by sphinx-3, so we are not modifying that system. One of the original reasons for storing everything in integer format was that integer arithmetic is faster than floating point arithmetic in most computers. However, this was not the only reason..
    Second - we do not actually store *probabilities* in the chmm files. Instead we store *expected counts* (which have been returned by the baum-welch algorithm). These have to be normalized by summing and dividing by the sum.
    Finally - the numbers you have listed below translate to the following integers:

    >(0,0) 00 01 89 C7
    This number translates to 100807
    
    >(0,1) 00 01 83 BF
    This number translates to 99263
    
    >(0,2) 00 01 10 AA                                                             
    This number translates to 69802
    
    These numbers are the logarithmic version of the floating point counts with one simple variation - the logbase is not "e" or 10; it is 1.0001. This small base was used for reasons of precision - larger bases would result in significant loss of precision when the logarithmized number was trunctated to integer.

    Q. The problem that I am facing is that I already have an HMM model trained & generated using the Entropic HTK and I want to try to use your decoder with this model. So I am trying to build a conversion tool to convert from the HTK format to your format. In HTK format, the trasition matrix is all stored in probabilities!! So how do I convert these probabilities into your "expected counts".

    A. You can take logbase 1.0001 of the HTK probs, truncate and store them.


    HYPOTHESIS COMBINATION

    Q. In the hypothesis-combination code, all the scaling factors of recombined hypotheses are 0. Why is this so?

    A. Since the hypothesis combination code does not perform rescaling of scores during combination, there is no scaling involved. Hence the scaling factor comes out to be 0. This is usually what happens with any method that rescores lattices without actually recomputing acoustic scores. Second, the code uses *scaled* scores for recombination. This is because different features have different dynamic ranges, and therefore the likelihoods obtained with different features are different. In order to be able to compare different accoustic features, their likelihoods would have to be normalized somehow. Ideally, one would find some global normalization factor for each feature, and normalize the scores for that feature using this factor. However, since we do not have global normalization factors, we simply use the local normalization factor that the decoder has determined. This has the added advantage that we do not have to rescale any likelihoods. So the true probability of a word simply remains LMscore+acoustic score. The correct way of finding scaling factors (esp. in the case of combination at the lattice level, which is more complex than combination at the hypothesis level) is a problem that, if solved properly, will give us even greater improvements with combination.

    Q. Given the scaling factor, the acoustic and LM likelihood of a word in the two hypothesis to be combined, how do we decide which one to be appear in the recombined hypothesis. For example, the word "SHOW" appears in both hypotheses but in different frames (one is in the 40th another is in 39th) - these words are merged - but how should we decide the begining frame of word "SHOW" in the recombined hypothesis, and why does it become the 40th frame after recombination?

    A. This is another problem with "merging" nodes as we do it. Every time we merge two nodes, and permit some difference in the boundaries of the words being merged, the boundaries of the merged node become unclear. The manner in which we have chosen the boundaries of the merged node is just one of many ways of doing it, none of which have any clear advantage over the other. It must be noted though that if we chose the larger of the two boundareis (e.g if we merge WORD(10,15) with WORD(9,14) to give us WORD(9,15)), the resultant merged node gets "wider". This can be a problem when we are merging many different hypotheses as some of the nodes can get terribly wide (when many of the hypotheses have a particular word, but with widely varying boundaries), resulting in loss of performance. This is an issue that must be cleared up for lattice combination.

    Q. I noticed that the acoustic likelihood of some emerging word doesn't change from the original hypothesis to the recombined hypothesis, For example, the word "FOR" has acoustic likelihood to be -898344 in one hyp and -757404 in another hypothesis, they all appear in the 185th frame in both hyp. but in the recombined hyp, the word "FOR" appears at 185th frame with likelihood -757404, the same as in one of the hypotheses. These likelihoods should have been combined, but it appears that they haven't been combined. Why not?

    A. The scores you see are *log* scores. So, when we combine -757404 with -898344, we actually compute log(e^-757404 + e^-898344). But e^-757404 >> e^-898344 as a result e^-757404 + e^-898344 = e^-757404 to within many decimal places. As a result the combined score is simply -757404.


    LANGUAGE MODEL

    Q. How can the backoff weights a language model be positive?

    A. Here's how we can explain positive numbers in place of backoff weights in the LM: The numbers you see in the ARPA format LM (use din CMU) are not probabilities. They are log base 10 numbers, so you have log10(probs) and log10(backoffweights). Backoff weights are NOT probabilities.

    Consider a 4 word vocab

    A B C D.

    Let their unigram probabilities be

    A     0.4
    B     0.3
    C     0.2
    D     0.1
    
    which sum to 1. (no [UNK] here).

    Consider the context A. Suppose in the LM text we only observed the strings AA and AB. For accounting for the unseen strings AC and AD (in this case) we will perform some discounting (using whatever method we want to). So after discounting, let us say the probabilities of the seen strings are: P(A|A) = 0.2
    P(B|A) = 0.3
    So, since we've never see AC or AD, we approximate P(C|A) with
    P(C|A) = bowt(A) * P(C)
    and P(D|A) with
    P(D|A) = bowt(A) * P(D)
    So we should get
    P(A|A) + P(B|A) + P(C|A) + P(D|A) = bowt(A)*(P(C)+P(D)) + P(A|A) + P(B|A)
    = bowt(A) * (0.1+0.2) + 0.2 + 0.3
    = 0.5 + 0.3*bowt(A)

    But the sum P(A|A)..P(D|A) must be 1

    So obviously bowt(A) > 1

    And log(bowt(A)) will be positive..

    bowts can thus in general be greater than or lesser than 1. In larger and fuller LM training data, where most n-grams are seen, it is mostly less than 1.


    TRAINING CONTEXT-DEPENDENT MODELS WITH UNTIED STATES

    Q. During the cd-untied phase, things are failing miserably. A very long list of "..senones that never occur in the input data" is being generated. The result of this large list is that the means file ends up with a large number of zeroed vectors. What could be the reason?

    A. The number of triphones listed in the untied model definition file could be far greater than the actual number of triphones present in your training corpus. This could happen if the model-definition file is being created off the dictionary, without any effective reference to the transcripts (e.g minimum required occurence in the transcripts = 0), and with a large value for the default number of triphones, OR if, by mistake, you are using a pre-existing model-definition file that was created off a much larger corpus.

    ACOUSTIC LIKELIHOODS AND SCORES

    Q. Acoustic likelihoods for words as written out by the decoder and (force-)aligner are both positive and negative, while they are exclusively negative in the lattices. How is this possible?

    The acoustic likelihoods for each word as seen in the decoder and aligner outputs are scaled at each frame by the maximum score for that frame. The final (total) scaling factor is written out in the decoder MATCHSEG output as the number following the letter "S". "T" is the total score without the scaling factor. The real score is the sum of S and T. The real score for each word is written out in the logfile only if you ask for the backtrace (otherwise that table is not printed). In the falign output, only the real scores are written. The real scores of words are both positive and negative, and large numbers because they use a very small logbase (1.0001 is the default value for both the decoder and the aligner).

    In the lattices, only the scaled scores are stored and total scaling factor is not written out. This would not affect any rescoring of a lattice, but might affect (positively or negatively) the combination of lattices because the scaling factors may be different for each lattice.

    Q. In the following example

    Decoding:
             WORD        Sf    Ef      Ascore      Lmscore
             SHOW        11    36     1981081      -665983
             LOCATIONS   37    99      -13693      -594246
             AND        100   109     -782779      -214771
             C-RATINGS  110   172     1245973      -608433
    
    falign:
              Sf    Ef     Ascore         WORD
              11    36    2006038         SHOW
              37    99     -37049         LOCATIONS
             100   109    -786216         AND
             110   172    1249480         C-RATINGS
    
    We see that the score from decoding and falign are different even for words that begin and end at the same frames. Why is this so? I am confused about the difference in the ABSOLUTE score (the one without normalization by maximum in each frame) from decode and falign. In the above example, the absolute score for word "locations" ( with lc "show", rc "and") begining at frame no. 37 and ending at frame no. 99 is -13693 from the decode (I get the number from the decode log file, with backtrace on). while the score for exactly the same word, same time instants and same context is different in falign output( -37049 ). Can this be due to the DAG merge?

    There are several reasons why falign and decoder scores can be different. One, as you mention, is artifically introduced DAG edges. However, in the forward pass there is no DAG creation, so AM scores obtained from the FWDVIT part of the decode will not have DAG creation related artefacts. Other possible reasons for differences in falign and decode scores are differences in logbase, differences in beam sizes, differences in floor values for the HMM parameters etc. Even when all these parameters are identical the scores can be different because the decoder must consider many other hyotheses in its pruning strategy and may prune paths through the hypothesized transcript differently from the forced aligner. The two scores will only be identical if word, phone and state level segmentations are all identical in the two cases. Otherwise they can be different (although not in a big way). Unfortunately, the decoder does not output state segmentations, so you can't check on this. You could check phone segmentations to make sure they are identical in both cases. If they are not, that will explain it If they are, the possibility of different state segmentations still exists.

    Q. The decoder outputs the following for utterance 440c0206:
    FWDXCT: 440c0206 S 31362248 T -28423295 A -23180713 L -536600 0 -2075510 0 <s> 56 -661272 -54550 TO(3) 73 -4390222 -89544 AUGUST 158 -2868798 -113158 INSIDER 197 -1833121 -1960 TRADING 240 -2867326 -74736 RECENTLY 292 -941077 -55738 ON 313 -1669590 -12018 THE(2) 347 -1454081 -63248 QUARTER 379 -4419716 -71648 JUMPED 511
    Now let's say I have the same utts, and the same models, but a different processing (e.g. some compensation is now applied), and in this expt I get:
    FWDXCT: 440c0206 S 36136210 T -25385610 A -21512567 L -394130 0 -1159384 0 <s> 56 -711513 -63540 TWO 73 -1679116 -52560 OTHER 103 -2163915 -52602 ISSUES 152 -2569731 -51616 BEGAN 197 -1952266 -22428 TRADING 240 -5408397 -74736 RECENTLY 333 -3049232 -47562 REPORTED(2) 395 -2819013 - 29086 </s> 511
    Let's say I want to see if this compensation scheme increased the likelihood of the utterance. can I just compare the acoustic scores (after the "A") directly, or do I have to take the scaling ("S") into account somehow (e.g. add it back in (assuming its applied in the log domain))?

    You have to add the scaling factor to the acoustic likelihood against the A to get the total likelihood. You can then compare the scores across different runs with the same acoustic models.

    DECODING PROBLEMS

    Q. I am trying to use the opensource SPHINX-II decoder with semicontinuous models. The decoder sometimes takes very long to decode an utterance, and occasionally just hangs. What could be the problem?

    Check your -agc*, -normmean and -bestpath flags. It is important to set the AGC/CMN flags to the same setting as was used to train the models. Otherwise, the decoder makes more mistakes. When this happens, when it tries to create a DAG for rescoring for the bestpath (which is enabled by seting "-bestpath TRUE") it gets trapped while creating the DAG and spends inordinate amounts of time on it (sometimes never succeeding at all). Even if the AGC/CMN flags are correct, this can happen on bad utterances. Set -bestpath FALSE and check if the problem persists for the correct AGC/CMN settings. If it does, there might be a problem with your acoustic models.

    INTERPRETING SPHINX-III FILE FORMATS

    Q. what's up with the s3 mixw files? the values seem all over the place. to get mixw interms of numbers that sum to one, do you have to sum up all mixw and divide by the total? any idea why it is done this way? is there a sphinx function to return normalized values? not that its hard to write but no need reinventing the wheel... here's an example of 1gau mixw file, using printp to view contents:

    --------with -norm no
    mixw 5159 1 1
    mixw [0 0] 1.431252e+04
    
            1.431e+04 
    mixw [1 0] 3.975112e+04
    
            3.975e+04 
    mixw [2 0] 2.254014e+04
    
            2.254e+04 
    mixw [3 0] 2.578259e+04
    
            2.578e+04 
    mixw [4 0] 1.262872e+04
    
            1.263e+04 
    
    -with -norm yes
    mixw 5159 1 1
    mixw [0 0] 1.431252e+04
    
            1.000e+00 
    mixw [1 0] 3.975112e+04
    
            1.000e+00 
    mixw [2 0] 2.254014e+04
    
            1.000e+00 
    mixw [3 0] 2.578259e+04
    
            1.000e+00 
    mixw [4 0] 1.262872e+04
    
            1.000e+00
    

    In s3, we have mixtures of gaussians for each state. Each gaussian has a different mixture weight. When there is only 1 gaussian/state the mixture weight is 1. However, instead of writing the number 1 we write a number like "1.431252e+04" which is basically the no. of times the state occured in the corpus. This number is useful in other places during training (interpolation, adaptation, tree building etc). The number following "mixw" like "mixw 5159" below merely tells you the total number of mixture wts. (equal to the total no. of tied states for 1 gau/st models). So

    --------with -norm no
    mixw 5159 1 1
    mixw [0 0] 1.431252e+04
    
            1.431e+04 
    
    implies you haven't summed all weights and divided by total and
    --------with -norm yes
    mixw 5159 1 1
    mixw [0 0] 1.431252e+04
    
            1.0000..
    
    implies you *have* summed and divided by total (here you have only one mixw to do it on per state), and so get a mixw of 1.
    WHY SPHINX III'S PERFORMANCE IS POORER THAN RECOGNIZER X?

    Q. Sphinx III's default acoustic and language models appear to be not able to take care of tasks like dictation. Why?

    (By Arthur Chan at 20040910) Design of a speech recognizer is largely affected by the goal of the recognizer. In the case of CMU Sphinx, most of the effort were driven by DARPA research in 90s. The broadcast news models were trained in the so called eval97 task. Where transcription are required to be done for broadcast news. The above explains why the model don't really work well for task like dictation. The data simply just for the use of dictation. Commercial speech application also requires a lot of specific tuning and application engineering. For example, most commercial dictation engine use more well-processed training material to train the acoustic model and language model. They also apply techniques such as speaker adaptation. CMU was very unfortunately don't have enough resource to carry out these researches.


    Maintained by Evandro B. Gouvêa
    Last modified: Wed Jul 26 13:34:28 Eastern Daylight Time 2006