The CMU Statistical Language Modeling (SLM) Toolkit
To the Speech at CMU page
Version 2
of the toolkit is the most up-to-date version publicly available.
Version 1 of the toolkit is also available.
Note: The CMU SLM toolkit is meant for large amounts
of training data. If you intend to train a language model from a few
dozen or even hundred sentences, please refer to the
lmtool.
The text here is from CMU_SLM_Toolkit_V1.0_release/doc/OVERVIEW,
HTML-ized.
The Carnegie Mellon Statistical Language Modeling (CMU SLM) Toolkit
is a set of unix software tools designed to facilitate language
modeling work in the research community.
Some of the tools are used to process general textual data into:
- word frequency lists and vocabularies
- word bigram and trigram counts
- vocabulary-specific word bigram and trigram counts
- bigram- and trigram-related statistics
- various Backoff bigram and trigram language models
Others use the resulted language models to compute:
- perplexity
- Out-Of-Vocabulary (OOV) rate
- bigram- and trigram-hit ratios
- distribution of Backoff cases
- annotation of test data with language scores
Future versions may include support for other modeling schemes, such
as Deleted Interpolation, and for adaptive language modeling
(e.g. caches).
In addition to their primary usage, the tools are also meant to be
used as building blocks for new experimental language models.
Contact Roni Rosenfeld about the SLM.
Page maintained by sphinx+web.