Sphinx Knowledge Base Tool

Simple Version [Advanced Version] FAQ

The lmtool builds a consistent set of lexical and language model files for decoders. The target decoders are the Sphinx family, though any system that can read ARPA-format files can use them.

Currently lmtool is configured for the English language (and its American dialect in particular). If you upload a corpus in a different language, your output is likely unpredictable. We are working on this. The current version does not deal gracefully with Unicode; this is also being worked on.

FIRST, CREATE A CORPUS FILE consisting of all sentences you would like the decoder to recognize. The sentences should be one to a line (but should not have punctuation symbols). You may not need to exhastively list all possible sentences: the decoder will allow fragments to recombine into new sentences; but the sentences you provide will be preferred. For example:

	THIS IS AN EXAMPLE SENTENCE
	EACH LINE IS SOMETHING THAT YOU'D WANT YOUR SYSTEM TO RECOGNIZE
	ACRONYMS PRONOUNCED AS LETTERS ARE BEST ENTERED AS A T_L_A
	NUMBERS AND ABBREVIATIONS OUGHT TO BE SPELLED OUT FOR EXAMPLE
	TWO HUNDRED SIXTY THREE ET CETERA
	YOU CAN UPLOAD A FEW THOUSAND SENTENCES
	BUT THERE IS A LIMIT

Use the lmtool!

[26 january 2010]
Version 3 is now ready for public use. lmtool has been reorganized internally to make use of the Logios package. This will make lmtool easier to maintain in the future and will allow it to take advantage of ongoing development in Logios. These changes should be transparent to regular users. Please give it a try. If you have any problems, or discover bugs, let the maintainer know. If things look good (i.e., I stop getting bug reports) this will become the standard version.

NOTE: If you have automated the use of this tool you will need to update your code. The main difference is that the name of the target script has changed. The old script will still be available so nothing will break immediately, but it's unlikely to continue to be maintained. Also, file links are no longer tagged in the html. Please let me know if you make use of this feature and I'll find a fix.

Alex Rudnicky