TRAINING ACOUSTIC MODELS USING THE SPHINX-III SYSTEM


This part of the manual is being written to enable people who are unfamiliar with the subjects involved in the science of speech recognition, to understand and fully utilize the potential of the Sphinx. As a result, the discussion may not be completely rigorously accurate at times. If you find any part of this discussion to be incomplete, unclear or objectionable, we would appreciate it if you would write to us and let us know. You may send your comments to rsingh@cs.cmu.edu or mseltzer@cs.cmu.edu

Every speech recognition "system" consists of two parts : a "trainer" and a "decoder". A trainer learns many things about the speech examples that you give to it, and encapsulates all that knowledge in a handful of files which are called "acoustic models". Once you have these files, your "system" is ready to do the rest of its job : to recognize any new speech that you give to it. This job is carried out by the decoder, which uses the acoustic models (and some *more* knowledge about your language, given to it in the form of a single file called the "language model", which is created by what are called "language modeling tools") to guess what was said, and give you its guess in the form of written words. These "written words" are called "hypotheses".

Acoustic models can be of many kinds. The kind learnt and used by the Sphinx are called "Hidden Markov Models", or "HMMs". We will explain HMMs in greater detail in a later section. For now we will use the word "HMMs" interchangeably with "acoustic models".

The first thing that you must keep in mind is that any system like this can only perform as well as it has been taught, or less. If you want it to recognize swahili, then you must teach it swahili by giving it examples of swahili. The more examples you give, the better it learns, and the better it is capable of performing.

USING THE SPHINX TRAINER TO TRAIN ACOUSTIC MODELS: THE FIRST STEP

Just as we learn the alphabet as a first step towards learning to read a language, the speech recognition system must be taught to recognize "units" of sound. These units are small or big chunks of sound that occur again and again when we speak, and sound more or less similar every time they occur. These are called "sound units", or "acoustic units".

The Sphinx trainer learns acoustic models for *prespecified* acoustic units occuring in a given set of acoustic data. In order to do this, the words and sounds that constitute every utterance in the corpus of acoustic data must be written down first in a SINGLE file. This is called the "transcript" file. For example, if you recorded the utterance "these are called transcripts", and put the recording in a file called utt.example1, and if this was your entire training data, then the transcript file would consist of a single line which would read:

THESE ARE CALLED TRANSCRIPTS (utt.example1)

If, while the utterance was being recorded, someone came into the room and banged the door shut, and the bang got recorded just after the word "called" was uttered, then the transcript file would read:

THESE ARE CALLED ++BANG++ TRANSCRIPTS (utt.example1)

If the corpus had a second sentence recorded in a file called utt.example2 then the corresponding transcript file for the corpus would look like

THESE ARE CALLED ++BANG++ TRANSCRIPTS (utt.example1)
SH- SHUT UP I AM RECORDING (utt.example2)
A transcript file must thus be an accurate or honest record of every sound that was recorded in the utterance, be it intelligible speech, partial words, or other sounds.

Then you must tell the Sphinx what acoustic units you want it to generate models for. You can do this by preparing a second ascii text file called the "dictionary". For example, here is a dictionary for the two utterance corpus given above:

THESE		@         
ARE 		@
CALLED		@
++BANG++	@
TRANSCRIPTS	TRANSCRIPTS
SH-		@
SHUT		@
UP		@
I 		@
AM		@
RECORDING	@
When you give this file to the Sphinx, you are telling it this:

"Ok- I am giving you a corpus. It has many words and I've listed each of them in the left column of this dictionary. Somehwere in this corpus is an acoustic example of the word TRANSCRIPTS. I want you to find out where it is and use the corresponding data to make me an HMM for this word only. I don't care about the other words. Put them all together and make me a model called @".

The Sphinx will then take the acoustic data you give it, the transcript file and the dictionary, and go through a series of steps to figure out which section(s) of the corpus correspond exactly to the word TRANSCRIPTS, will take the corresponding data from the corpus, and will estimate the HMM that best fits the data to give you a model labelled TRANSCRIPTS. It will collect the rest of the data and make you an HMM labelled @.

So you would have acoustic models labelled TRANSCRIPTS and @, which you can use to recognize. This set of models would serve you well in a recognition task where all you have to do is find out if the speech you encounter has the word "TRANSCRIPTS" or not.

You can use the same corpus to estimate more detailed acoustic models. You could, for example, construct a dictionary that looks like this:

THESE		THESE         
ARE 		ARE
CALLED		CALLED
++BANG++	++BANG++
TRANSCRIPTS	TRANSCRIPTS
SH-		SH-
SHUT		SHUT
UP		UP
I 		I
AM		AM
RECORDING	RECORDING
Now you are telling the Sphinx, "I am giving you acoustic examples of all the words in the left column. I want you to find out which section(s) of the corpus correspond to each of these, and use the corresponding data to make the models listed in the right column" Remember that the right hand column is similar to the left hand column merely for convenience. The same models would be estimated by using the dictionary
THESE		alpha
ARE 		beta
CALLED		gamma
++BANG++	eta
TRANSCRIPTS	mod1
SH-		##
SHUT		99
UP		ASDA
I 		c
AM		XXX
RECORDING	@@
except that instead of being *labeled* as the words themselves, they would be labeled as alpha, beta, gamma and so on.

In all these examples, you are modeling complete words as the units of sound. When you want to be able to build acoustic models to recognize thousands or even tens of thousands of words, it is not advisable to use whole words as units of sound for many reasons which we will explain in the next section. When you build acoustic models that enable you to recognize tens of thousands of words, the "system" of models you build is called a Large vocabulary continuous speech recognition (LVCSR) system.

Sphinx is the tool that helps you build these systems.

ACOUSTIC UNITS : WHICH UNITS ARE BEST FOR MY TASK?

If you have a small amount of data to give to the sphinx for training your acoustic models, then remember two things before designing your dictionary. The sphinx builds statistical models. In satistics, anything is better estimated if you have more data to estimate it. When the data is too less, then the estimates are poor. So if you ask the Sphinx to build one model for every word, and if all you have are one or two acoustic examples of each word, the Sphinx will not have enough data to estimate the models properly. So you have to somehow reduce the number of units you ask it to train models for. There is a second problem. If you train models for a certain number of words, and want to use them to recognize a different word, clearly you are making a mistake. You really cannot recognize any word that you haven't taught the system to recognize, and for which you don't have a model. What must you do to avoid this problem and also that of data insufficiency?

Here's what you must do. Modify the dictionary to have a smaller number of acoustic units - smaller than a word, but from which every word can be easily constructed. One example is as follows:

THESE		THESE
ARE 		ARE
CALLED		KO LD
++BANG++	BANG
TRANSCRIPTS	TR AE NSK RI PTS
SH-		SH
SHUT		SH UT
UP		UP
I 		I
AM		AE M
RECORDING	RI KO RDING
Now you are telling the Sphinx to make acoustic models for 20 units, rather than the 11 that you had earlier. This will land you into even more data insufficiency problems, since the data will have to be shared between 20 models now instead of 11. But then consider this: once you have these 20 models, although you never saw or trained a model for the following words, you can construct them using the existing models and you can actually recognize them!
RIM	        RI M
MY	        M I
ACCORDING	AE KO RDING
MUTT		M UT
By now you must have realized that if you had been smarter and designed the dictionary with better units, you may have been able to recognize even more words! For example, if your dictionary had looked like this
THESE		TH EE Z
ARE 		AA R
CALLED		K AO L D
++BANG++	+BANG+
TRANSCRIPTS	T R AE N S K R I P TS
SH-		SH
SHUT		SH AX T
UP		AX P
I 		AY
AM		AE M
RECORDING	R I K AO R D ING
it would have given you models which would have enabled you to recognize the following words:
THAT 		TH AE T
CZAR		Z AA R
COT		K AO T
SHIP		SH I P
DARK		D AA R K
...
and many others

But then you have () units - leading to even more data insufficiency.

Acoustic modeling is all about cleverly designing your specifications of the acoustic units in order to get maximum extensibility with the minimum aggravation of the data insufficiency problem. Let us now look at a better example: Suppose your transcripts were as follows:

CAT RAT BAT CAB TAB (utt.example3)
BAR STAR CAR ART CART BART (utt.example4)
If you were to models each owrd as a separate acoustic unit, here's the dictionary you would design
CAT		CAT
RAT		RAT
BAT		BAT
CAB		CAB
TAB		TAB
BAR		BAR
STAR		STAR
CAR		CAR
ART		ART
CART		CART
BART		BART
The Sphinx would then give you 11 acoustic models labeled as each word separately. The acoustic data given would have been distributed among these words, and may not have been sufficient to estimate any of the models properly. To fix this problem of data insufficiency, here's what you might do:
CAT		K AE T
RAT		R AE T
BAT		B AE T
CAB		K AE B
TAB		T AE B
BAR		B AA R
STAR		S T AA R
CAR		K AA R
ART		AA R T
CART		K AA R T
BART		B AA R T
As before, you have modeled the data with acoustic units smaller than words. If you distribute the units as shown above, then the Sphinx would estimate a set of 7 acoustic models for you, instead of 11, and the data available for most of these would be greater than before. The acoustic models would be better estimated, and of course, as explained earlier, the recognition vocabulary would have many more words that could be constructed out of these acoustic units.

As it turns out, even when we have a corpus of thousands of words to train with, it is always advantageous to use acoustic units which are smaller than words. The units have to be defined in a way that maximum advantage of the acoustic data in the sense that each unit get good amounts of data to train with, and at the same time results in maximum extensibility of the vocabulary. The vocabulary, incidentally, is simply the number of words you can consider for any purpose - in this context - recognition.

In the example above, if you want to recognize only the words that you have seen in the corpus, it may be advantageous to just train words as acoustic units.

The bottomline is: Before you begin training, think very seriously about the dictionary. It will dictate the quality and flexibility of your recognition system.

WHAT DO THE ACOUSTIC MODELS LOOK LIKE AND WHAT HAPPENS IF THE MODELS ARE NOT PROPERLY ESTIMATED?

A smooth, symmetrical hill looks like what is called a "Gaussian distribution". A bunch of such hills (of the same or different sizes) would look like a "mixture" of Gaussian disributions. When an outline of these is sketched on a flat surface like a sheet of paper, they look like what are called "one dimensional" (Gaussian) distibutions. The key word here is "distribution". Something is distributed. When distributed, the resulting pattern has the shape of these distributions.

If you took some data from segments corresponding to similar speech sounds, just as the Sphinx does after it figures out which segment of the speech signal to take them from, and counted the instances of each type of number, and plotted these against the type of number, then it would look like a "Gaussian distribution". This is not true for all kinds of signals, but roughly correct for a speech signal. The more the numbers you consider (ie, the more data you have), the smoother would be the outline of this hill, and the more symmetric would the hill be. You might want to do more with the data you have, if you are curious. You might say that ``I am going to sub-divide this data into many parts and see if each of these parts gives me a hill pattern''. That's a good idea. It might, actually. You have now computed a ``mixture Gaussian'' out of the data. But consider this. Is this the only mixture gaussian that you could have computed from the data? Obviously, the answer is no. You could very well have divided the data arbitrarily into parts different from what you did the first time, and this would have given you a different ``gaussian mixture'' distribution for the same data.

What you are doing now is modeling the data with a mixture of gaussians.

But where does it stop? You could keep dividing the data into arbitrarily many parts and computing gaussian mixture distributions for the data. But consider this. If you divided it into *too* many parts, there wouldn't be enough data to outline your hill for you in any part. You'd just get a rather spiky picture. The highest number of parts in which you can divide the data without having too few data points in each part to estimate the gaussian properly, is called the "finest" distribution that you can model the data with. Now suppose you are sufficiently conscious of this fact and have weighed it carefully against the data you have available, and have decided that you are going to model the data with a mixture of 10 gaussians. There are still many many ways in which you can divide the data into 10 parts to compute the gaussians. Only one of these ways actually results in a mixture of 10 gaussians that "fit" the data best. Imagine the notion of "fit" in any way you want to - you will not be very far from the truth. Depending on what you imagine, you can define "fit" in any way. Mathematicians like to imagine "fit" in certain ways only, and have an "objective function" that indicates how they have imagined the notion of "fit". We will discuss this later.

For now, here is what you first need to understand. The Sphinx has a certain notion of "fit" coded into its mind. Using this, and using many procedures developed over the years by researchers, it can find out for you the best way to divide your data into 10 parts, in the sense that the resulting miture gaussian "fits" the data you give to the Sphinx best. The way in which it understands "fit" is called the "maximum likelihood" sense. We will discuss this too later.

So if you just gave the Sphinx your data and instructed it to model it with a mixture of 10 gaussians, it would do what is described above, and give you a mixture of gaussian that best "fits" your data. The mixture of 10 gaussians is your acoustic model, the "finest" good model that you could make.

However, asking the Sphinx to build a mixture of 10 gaussians from all the data corresponding to the word TRANSCRIPTS (which it would have hunted out for you first) might be asking for too less. The Sphinx can do something more than this for you. For speech recognition, modeling the data with 10 gaussians is not enough: this kind of model would be too coarse. Consider the fact that TRA is very different from NSC and this is very different from RIPTS and that they follow each other in a sequence. You can actually get an even better picture of the data if you model TRA with a mixture of 3 gaussians, NSK with a mixture of another 3 and RIPTS with another three. The Sphinx can also tell how much time you are going to take to transition your TRA into NSK and then into RIPTS (depending on how fast or slowly you're speaking), which will include the information on how long you're going to stretch TRA and NSK and RIPTS. If you let it do this, then you would have a model which has 9 gaussians, 3 each modeling the three parts of your data in sequence, with information about how likely one part is to transition into another and how likely it is to just stretch out. This is a Hidden Markov Model. It is "hidden" because the the information it holds is there in the sequence of data that corresponds to the word TRANSCRIPTS - you just can't see it. The Sphinx has to hunt it out for you and present it to you. The information about how likely a part is to stretch out a bit in time, as the R's and N's are doing in "TRRRRA NNNSK RRRRIPTS" is called the "self transition probability". The information about how likely is it that you have nearly finished slurring the first part and are moving on to the next is just called the "transition probability", although it should have been called the "relay probability" or something equivalent.

Suppose I just arrange these probabilities for the model of the word TRANSCRIPT as follows:

   parta-stays-parta   parta-goesto-partb  parta-goesto-partc
   partb-goesto-parta  partb-stays-partb   partb-goesto-partc
   partc-goesto-parta  partc-goesto-partb  partc-stays-partc                   
Let's put a 0 where what is said is clearly unlikely and a cross where what is said may happen. We get
            x      x     0
            0      x     x
            0      0     x
This is a matrix of probabilities, and for every model you ask the Sphinx to train for you, it will also give you such a "transition probability matrix" containing the transition likelihood information. So there will definitely be one corresponding to the word TRANSCRIPTS.

Each "part" is called a "state". Thus the acoustic model consists in this example of three states, each modeled by a mixture of three gaussians, and the corresponding matrix of state transition probabilities.

Whatever you have imagined this to look like, it is correct. Some people like to imagine each state to be a circle, with three hills drawn within each circle, and arrows looping from each circle to itself (denoting the self transition) and arrows going from one circle to the next (denoting the realy transition). You might go a step futher and ask "what if I have an arrow going out from one circle, skipping the next, and going into the subsequent one?". Well, that's possible. Sometime we do that when we speak fast. We skip some sounds. At a more subtle level, we skip some data which may be a fraction of a sound. If you tell the Sphinx that your speech is likely to have this quirk in it, it is capable of finding for you even those transition probabilities. An inevitable question follows: How do I tell this to the Sphinx? Its easy. Just make a matrix like the one shown above, and put 1's instead of x's in events that you want the Sphinx to know are possible. For fast speech where you are likely to skip some sounds, the matrix may look like:

     1   1   1
     0   1   1
     0   0   1
This matrix is called the "Hidden Markov Model toplogy matrix" or just the "topology matrix".

The Sphinx is capable of computing HMMS of any topology for you, so long as its reasonable (you CANNOT go back from NSK to TRA in TRANSCRIPTS - that's not reasonable!). The topology matrix tells the sphinx that you want your acoustic models to have the same number of states as there are vertical columns in it, and also tells the Sphinx what other transition information you want it to extract for you.

You *cannot* figure out this information yourself. It involves an enormous amount of computation. The Sphinx, in this respect, is far more capable than you are.

You might argue that you do figure out this information in some way, how else would you understand speech much better than the Sphinx? Well, the truth is that you do not really know if this is the *exact perspective* in which you learn that information. This is the HMM's perspective. Not yours. You don't know, and no one does yet, how and what exactly is the perspective in which the human brain figures out this information. We "model" it with HMMs - but we could have modelled it in any other way as well. Like the Sphinx there are speech recognizers out there which use models that are very different from the HMMs. Researchers are trying very hard still to understand this and to develop a model that sees information in a perspective closest to the human brain. The Sphinx is still evolving, and one day may change its modeling paradigm from HMMs to something more accurate. Till that happens, HMMs are the best models researchers have come up with to date.

Another reason you recognize speech much better than the sphinx is that you have a much better idea of the world around you, or the *context* in which things are said, than the sphinx. If it were merely a task of trying to identify which sounds were being said, without any reference to meaning, the sphinx would outperform you every time.

To help you visualize an acoustic model, here's a picture with no labels.

                                                                         
         _          _         _
        | |        | |       | |
        | v        | v       | v
         _____      _____     _____
        |     |    |     |   |     |
    --->| ^^^ |--->| ^^^ |-->| ^^^ |--->
        |_____|    |_____|   |_____|
            \                  /
             \________>_______/

What happens when an acoustic model is not "good"? Well, the gaussians may not represent the speech event that is being modelled. The model for TRANSCRIPTS may look confusingly like the model for CHANCEGIFTS or worse, may look more like TLAMSBIFS., or even like BLIP.

A parting comment for this section: the data that the Sphinx works on is not really the string of numbers that you see in the speech signal. The Sphinx first translates that string into a sequence of sequences of numbers. Each sequence of numbers is about 40 numbers long, but you have a lot of control over that. Each sequence of 40 numbers is called a "feature vector", and so the hills that you have imagined are actually surreal objects which have 40 facets to them, or 40 dimensions to them. These are very complex objects now to imagine, but so long as you remember this fact, working with hills and understanding the working of the Sphinx with their help is just fine. Sometimes you will just have to do what you are doing 40 times over, or look at what you are looking at through a 40-lensed instrument, that's all.

WHAT THE SPHINX DOES FIRST : TRANSLATION OF SPEECH SIGNAL INTO FEATURE VECTORS

(to be written..)

Back to index


last modified: 16 Nov. 2000