Simple Version [Advanced Version] FAQ
The lmtool builds a consistent set of lexical and language model files for decoders. The target decoders are the Sphinx family, though any system that can read ARPA-format files can use them.
Currently lmtool is configured for the English language (and its American dialect in particular). If you upload a corpus in a different language, your output is likely unpredictable. We are working on this. The current version does not deal gracefully with Unicode; this is also being worked on.
FIRST, CREATE A CORPUS FILE consisting of all sentences you would like the decoder to recognize. The sentences should be one to a line (but should not have punctuation symbols). You may not need to exhastively list all possible sentences: the decoder will allow fragments to recombine into new sentences; but the sentences you provide will be preferred. For example:
THIS IS AN EXAMPLE SENTENCE EACH LINE IS SOMETHING THAT YOU'D WANT YOUR SYSTEM TO RECOGNIZE ACRONYMS PRONOUNCED AS LETTERS ARE BEST ENTERED AS A T_L_A NUMBERS AND ABBREVIATIONS OUGHT TO BE SPELLED OUT FOR EXAMPLE TWO HUNDRED SIXTY THREE ET CETERA YOU CAN UPLOAD A FEW THOUSAND SENTENCES BUT THERE IS A LIMIT
Use the lmtool!
[26 january 2010]
NOTE: If you have automated the use of this tool you will need to update your code. The main difference is that the name of the target script has changed. The old script will still be available so nothing will break immediately, but it's unlikely to continue to be maintained. Also, file links are no longer tagged in the html. Please let me know if you make use of this feature and I'll find a fix.