(This is under construction.)

  1. Miscellaneous



Generating lattices:

Lattices can be generated by including the flag

-oulatdir [directory in which you want to write lattices]
in the argument of the s3decode binary. Corresponding to each utterance, the decoder will then write a lattice in the directory you have specified. The lattice will be named, and the contents of the file can be seen by giving the command "zcat filename" from the commandline on a unix machine.

If the utterance name in the ctl file includes directory names too, then you have the option of including or excluding them from the lattice filenames by including or excluding the string ,CTL from the argument you give to -outlatdir, respectively. This string is appended directly after the argument, without a space. Thus if the argument you give is

-outlatdir current
the extended argument would be
-outlatdir current,CTL

Generating N-best lists from lattices:

N-best lists can be generated from the lattices by using the binary s3astar. It works just like the decoder (it takes the same controlfile as the decoder, and the inlatdir is the same as the outlatdir that the decoder used). You need to additionally provide an nbestdir where the N-best files are written. The number of hypotheses in any N-best list can be specified using a -nbest argument (the default value is 200, but note that just becuase you ask for 200 hypotheses it does not mean that you will get 200 hypotheses. If the lattice holds fewer than 200 possible hypotheses, you'll get fewer hypotheses). The N-best files will look like matchseg outputs.

Example of a lattice and explanation of format:

The lattice has three distinct sections. In the first all the nodes in the graph with their associated words and being and end times are listed. In the second section the acoustic scores associated with each of the nodes is listed. In the final section the scores associated with the edge between any two words is listed. The lattice also has additional lines of information mentioning the total number of nodes in the graph, the id of the first and last nodes, and text describing the format of the lines in the lattice. In addition, the lattice may contain lines that begin with a "#". These are comments.

Here are examples from each of the components of the lattice. Explanations are interspersed.

0 ++GARBAGE++ 254 256 256
1 ++LAUGH++ 254 256 256
2 ++N++ 254 256 256
3 ++GARBAGE++ 253 255 255
4 ++LAUGH++ 253 255 255
5 ++N++ 253 255 255
6 ++GARBAGE++ 252 254 254
7 ++LAUGH++ 252 254 254
8 ++N++ 252 254 254
9 ++GARBAGE++ 251 253 253
10 ++N++ 251 253 253
11 A 251 253 253
12 ++GARBAGE++ 250 252 252
13 ++N++ 250 252 252
14 A 250 252 252
15 ++N++ 249 251 251
16 HAVE 245 250 253
17 ARE(2) 245 250 250
18 HAVE 244 249 249
19 GO 244 249 251  
Node no. 16 is the word HAVE and begins on the 245th frame and can end anywhere between the 250th and 253rd frames.
Initial 1948
Final 82
Nodes are written out in *reverse* order in the lattice. As a result, the node that is written out last is actually the *first* node in the lattice. Nodes are also not written in stricly reverse sequential order since, due to the "stretch" in the ending frames of different nodes, it is difficult to determine a precise sequence for all but the first node. As a result, in this lattice, the first node was node number 1948 (the one written out last), but the last node was actually node 82.
1948 2 -172014
1948 3 -207858
1948 4 -220188
1947 5 -351673 
While a node can end at many different frames, the acoustic score associated with the node when it ends at a particular frame will be different from that associated with it when it ends at a different frame. This portion of the lattice shows this information. In this example, when node number 1948 ends at frame 2 it has an acoustic score of -172014, when it ends at frame 3, the acoustic score is -207858, etc. Note, however that this acoustic score is only the best score and is not really useful since the true score for the node would depend on the path being considered due to the existence of cross-word triphones.
33 23 -243293
33 20 -297751
35 23 -1599007
37 23 -1923161     
The true acoustic score for any word is dependent on the word following it in the path. We therefore associate this score with the *edge* leading from that word to the following word. There can be many edges leading out of a node even at a given frame. Each of these edges is likely to have a different score than the other edges. In the above portion of the lattice we are given the information that the edge from node 33 to node 23 has the score -243293, the edge from node 33 to node 20 has score -297751 and so on. Keep in mind that there can be only one edge between any two nodes, even though a node can end at many different frames. This is because only one of these possible ending frames will permit a proper edge to the unique starting frame of the next word.
1948 1440 -2083713
1948 1399 -220188
The lattice ends here.

Note also that a lattice is actually a *tree*, and so the left context of any node is fixed. So, the variations in acoustic scores of words are only due to the right contexts, since any node in a tree can have only one predecessor. However, what the sphinx3 writes out is not a lattice, but actually a DAG, or a directed, acyclic graph. What is done here is that nodes representing the same word in the lattice are merged if they have identical time stamps. What you see in the "lattice" file is actually this DAG and not a tree-structured lattice at all.

An important consideration in combining lattices from different sources:
if you had two parallel paths of this kind:

   ......> WORD1 ------> WORD2
   ......> WORD1 ------> WORD2
(WORD1 = "and", say and WORD2 = "the")

You CAN merge it to

   .....> WORD1 ------> WORD2

*If* you are using CI models! Then the two parallel edges (the dashed edges) would have had close to identical scores, so you could just take the highest score. But if you are using CD models here's what will happen: the edges from path1 and path2 will have *different* scores in the lattice *even* if both WORD1 and WORD2 begin and end at exactly the same time instants in both cases. This is because the the word preceding WORD1 in the two cases would have been different, so the cross-word triphone score of the first phone in the word would have been different. e.g.

   OK......> WORD1(and) ------> WORD2(the)
   BIT......> WORD1(and) ------> WORD2(the)
the word preceding "and" in the first path is "ok", the x-wd triphone at the beginning of and in the first path is A(EY,N). The preceding word is "bit" in path 2, the x-wd triphone at the beginning of "and" is A(T,N). So the score of the edge between "and" and "the" would reflect this in the two paths and be different. All this even when you are only working with a *single* lattice (e.g. the MFC lattice). Any heuristic, like using the highest score for the merged path (node/edges) is likely to backfire for this reason (but would have to be experimentally tested). If path1 is (say) from and MFC-based lattice and path2 is (say) from a PLP-based one, this problem is compunded by the additional problem of how to come up with correct scaling factors for the scores.


At the top of the LM are the lines:
ngram 1=NUM1
ngram 2=NUM2
ngram 3=NUM3
This means that there are NUM1 unigrams, NUM2 bigrams and NUM3 trigrams in the LM. Then you have a line
This means that all following lines are unigrams until you encounter a line "\2-grams:" or a "\end\" marker. The \end\ marker marks the end of the arpa LM. All unigrams have the form
NUMa is the log probabilty of the unigram for the word WORD. NUMb is the back-off weight associated with that word. For bigrams entries may be
The first form of entry is when the LM also has trigrams. If it is only a bigram LM the entries will be of the second form. Here, NUMa is the log prob of the bigram P(WORD2 | WORD1) and NUMb is the back-off weight for the word pair (WORD1 WORD2). The general N-gram entry is of the form
or if it is an Ngram model
All logarithms are base 10. To prune the LM you can delete all N-gram entries where the difference between the probability entry for that Ngram and the predicted probability for the N-gram obtained by backing off is very small. The predicted probability is (of course) for trigrams: P(C|A,B) ~ P(C|B) * backoffwt(A,B). For bigram P(C|B) ~ P(C) * backofwt(B) Pruning is easiest done only on the highest order Ngram since deleting lower order Ngrams will delete the back-off weight for that Ngram as well and affect our prediction for the higher order Ngram. For example, if we pruned P(B|A) out of the LM, then the backoffwt(A,B) would also get pruned out. This affects the estimate of P(C|A,B), and the pruning heuristic would have to be appropriately considered.


Matchseg files can be generated by using the flag -matchsegfn [filename] in the argument of the decoder binary.
SB01 \
    S 36683016 \
    T -27154407 \
    A -21711975 \
    L -858796 \
    0 -603268 0 <s> \
    17 -814154 -49595 SHOW \
    40 -594450 -11806 THE(3) \
    51 -463637 -23023 <sil> \
    65 -2880525 -88849 ITS(2) \
    133 -1203392 -72333 DATES \
    171 -806587 -5753 FOR \
    185 -898344 -26773 ALL \
    210 -2459017 -71603 DEPLOYED \
    260 -1765774 -94176 CEP \
    302 -1791387 -76622 EVERETT(2) \
    338 -843218 -62384 THEIR \
    355 -848925 -17748 HOME \
    376 -1482528 -2325 PORT \
    411 -1058809 -79875 SPEED \
    435 -786647 -37608 BY \
    453 -1771960 -32704 FIVE \
    482 -363323 -23023 <sil> \
    489 -276030 -82596 </s> \
This is a hypothesis in the "matchseg format". This output usually comes in a single line, but it was rearranged above to make it easier to read. The first word (or "field") is the filename, S is a scaling factor (to prevent integers from wrapping around due to underflowing - it can be thought of as a normalization factor for likelihoods). T is the total likehood of the utterance, A is the acoustic likelihood L is the LM likelihood (these are all log likelihoods, hence the large numbers). Then onwards the format is beginning_frame_number acoustic_score lm_score WORD beginning_frame_number acoustic_score lm_score WORD ........ and so on. In the end, the LAST frame number of the utterance is written (514 in this case).
The hypothesis itself is the string of all words between <s> and </s>. (The hypothesis combination program needs two or more such matchseg files [in the same order] and outputs a matchseg file which is the best path hypothesis in the graph constructed from the input matchseg files.)



-compress : compress excess background frames
-compress : compress excess background frames based on prior utt
Typical silence compression code is as follows:

   if (silcomp == COMPRESS_PRIOR) {
        j = 0;
        for (i = 0; i < nfr; i++) {
            if (histo_add_c0 (mfc[i][0])) {
                if (i != j)
                    memcpy (mfc[j], mfc[i], sizeof(float)*CEP_SIZE);

                comp2rawfr[j++] = i;
            /* Else skip the frame, don't copy across */
        nfr = j;

The "silence" frames are actually deleted.

note: This is not good when you are using models trained in the standard manner using SPHINX-III. Deleting silence frames completely during decoding (regardless of whether they are put back in the seg file later) is bad. We train cross-word triphones with silence as context explicitly. There are usually hundreds of such triphones in the model set. If there are no silence frames at all in the sequence of frames being decoded, the cross-word triphones with silence never get a chance of being used. Note that in most model sets, the silence and breath models are usually the best trained models.

last modified: 22 Nov. 2000