Statistical Language Modeling Toolkit

The CMU-Cambridge Statistical Language Modeling toolkit is a suite of UNIX software tools to facilitate the construction and testing of statistical language models. The SLM toolkit is meant for large amounts of training data. If you intend to train a language model from a few dozen or even hundred sentences, please refer to the lmtool.

Version 1 was written by Roni Rosenfeld at Carnegie Mellon University.

The toolkit has been rewritten by Philip Clarkson and Roni Rosenfeld, and now provides increased functionality and efficiency. Version two is no longer limited to the use of bigram and trigram models, and provides support for n-grams of arbitrary size. It also provides support for several discounting schemes, rather than limiting the user to the Good-Turing discounting strategy used in version one. In addition, the tools used to count word n-grams, vocabulary n-grams and id n-grams have been re-written to increase greatly their speed of operation. Other changes include a more flexible way of handling context cues, the ability to calculate probabilities from ARPA format language models, the ability to force the model to back-off under certain circumstance (for example, if there is an unknown word in the context), and support for gnuzip compressed files as well as files compressed with the compress utility.

Please note that as of June 1999, I am no longer at Cambridge University, and am therefore unable to provide a great deal of support for the toolkit. I will try to provide answers to any quick questions that come up, however. The e-mail address at the foot of this page should continue to work for the foreseeable future.


Philip Clarkson - prc14@eng.cam.ac.uk

Last modified 7th June 1999