The CMU Sphinx Group Open Source Speech Recognition Engines

Speech at CMU   |   Sphinx at SourceForge

Introduction

General Documentation

CMUSphinx Components

Common library

Decoders

Acoustic Model Training

Language Model Training

Utilities


Latest News

Sphinx4-1.0beta3 released
2009-08-17 19:23
Read More »

SphinxTrain 1.0 Released
2009-02-12 16:05
Read More »

Sphinx-4 1.0 beta2 released
2009-02-07 18:27
Read More »

Site news archive »


External Links

Notice: if you have comments about the links below, please contact the authors directly.

CMU Sphinx documentation Wiki

Training an acoustic model with LDA and MLLT feature transforms

Attention! This feature is for single-stream (i.e. SphinxThree) models only. It will not work for Sphinx2 or PocketSphinx. Sorry about that folks...

A recently added feature to SphinxTrain is the ability to train feature-space transformations for acoustic models. There are a couple of benefits to using these. First of all, it can dramatically reduce the word error rate (up to 25% relative in some of our tests). Second, it also makes the decoder slightly faster since it reduces the dimensionality of the features, and also reduces the size of the acoustic model.

Unfortunately the training process becomes a bit more involved when using this feature. The reason is that it's necessary to do some parts of training several times over. Specifically, you have to train a basic model in order to train each feature transformation, then retrain the model with the transformation applied to the input features. This has to be done for each feature transformation (currently there are two of them as they have been found to have additive effects).

These feature transformations are "discriminative" in the sense that they try to improve the separability of acoustic classes in the feature space. This means that it's necessary to define the set of acoustic classes on which they are trained. There are two obvious choices which both seem to work well - the simplest one and the quickest to train is simply the context-independent phonemes. The more involved one is to use the context-dependent tied triphones (senones). In both cases, the SphinxTrain scripts try to automate the whole process for you.

Required software components

First, you need to have the necessary Python modules installed in order to do LDA and MLLT. You should (obviously) also have Python 2.3 or newer. To make sure that you have the necessary modules installed, make sure that you can run Python and load the numpy and scipy.optimize modules:

dhuggins@lima:~$ python
Python 2.5.1 (r251:54863, Oct  5 2007, 13:36:32)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> import scipy
>>> import scipy.optimize
>>>

Configuration changes

Once you're sure that Python is working, you need to make sure that you are using the most recent version of SphinxTrain (either a Subversion check-out or the nightly tarball) and that the Python modules in your copy of SphinxTrain have been built. To do this, run python setup.py build in the python directory of SphinxTrain.

You should also make sure that the SphinxTrain scripts are up to date in your training directory. The easiest way to do this is simply to blow away the scripts_pl and python directories and run setup_SphinxTrain.pl from the SphinxTrain scripts_pl directory.

Finally, you need to turn on LDA and MLLT in your sphinx_train.cfg file. To do this make sure it contains two lines reading:

$CFG_LDA_MLLT = 'yes';
$CFG_LDA_DIMENSION = 29;

You can adjust $CFG_LDA_DIMENSION if you like, though 29 or 30 seems to be a nearly-optimal value for many data sets.

Decoding with the MLLT model

First, you need to update your decoding scripts.

If you are using the released version of SphinxThree (sphinx3.7), then you will also have to add the MLLT transformation file to the command-line. If your acoustic model is called rm, then the file will be called rm.mllt and it lives inside the model_parameters folder. So, using the example above, you would add something like this to your command-line:

-lda model_parameters/rm.mllt

Expected results using MLLT

Using MLLT, you can hope for roughly a 25% improvement. For example, if you had 70% accuracy:

  • 70 + (100 - 70) * 0.25 = 77.5% You would now get a 7.5% improvement to 77.5%.

Cepstral Window Features

It's possible to use LDA to bypass feature extraction step which is a linear transform as well. In theory it could give some improvement in accuracy. To do that you need with a latest trunk:

  • Set feature type to 1s_c
  • Add $CFG_FEAT_WINDOW=3; to the config file
  • Train with MLLT
  • Apply the attached patch to sphinxbase cepwin.diff.

  • Decode

You can use these models in sphinx4 now, the following config should do the work:

<component name="featureExtraction" type="edu.cmu.sphinx.frontend.feature.ConcatFeatureExtractor">
<property name="windowSize" value="3"/>
</component>
<component name="lda" type="edu.cmu.sphinx.frontend.feature.LDA">
<property name="loader" value="sphinx3Loader"/>
</component>

There is no recommendation for the optimal parametrers yet, but it seems that something like cepwin=3 and final dimension around 40 should work. We hope to get results on this soon.

LDAMLLT (last edited 2009-05-01 19:37:25 by NickolayShmyrev)

SourceForge.net Logo This page is maintained by David Huggins-Daines ()
CMUSphinx is a project within the Sphinx Group at Carnegie Mellon