Simple Version  FAQ
The lmtool builds a consistent set of lexical and language model files for decoders. The target decoders are the Sphinx family, though any system that can read ARPA-format files can use them.
Currently lmtool is configured for the English language (and its American dialect in particular). If you upload a corpus in a different language, your output is likely unpredictable. We are working on this. The current version does not deal gracefully with Unicode; this is also being worked on.
FIRST, CREATE A CORPUS FILE consisting of all sentences you would like the decoder to recognize. The sentences should be one to a line (but should not have punctuation symbols). You may not need to exhastively list all possible sentences: the decoder will allow fragments to recombine into new sentences; but the sentences you provide will be preferred. For example:
THIS IS AN EXAMPLE SENTENCE EACH LINE IS SOMETHING THAT YOU'D WANT YOUR SYSTEM TO RECOGNIZE ACRONYMS PRONOUNCED AS LETTERS ARE BEST ENTERED AS A T_L_A NUMBERS AND ABBREVIATIONS OUGHT TO BE SPELLED OUT FOR EXAMPLE TWO HUNDRED SIXTY THREE ET CETERA YOU CAN UPLOAD A FEW THOUSAND SENTENCES BUT THERE IS A LIMIT
Use the lmtool!