Kylm - The Kyoto Language Modeling Toolkit

日本語

This is the Kyoto Language Modeling toolkit (Kylm), a language modeling toolkit written in Java. It contains features including:

Download/Install

Source code and JAR files for the toolkit can be found here.

Program Documentation

CountNgrams

CountNgrams is a program to calculate a smoothed n-gram model from a text corpus.

Example: java -cp kylm.jar kylm.main.CountNgrams training.txt model.arpa

N-gram model options
    -n:         the length of the n-gram context [default: 3]
    -trim:      the trimming for each level of the n-gram (example: 0:1:1)
    -name:      the name of the model
    -smoothuni: whether or not to smooth unigrams

Symbol/Vocabulary options
    -vocab:     the vocabulary file to use
    -startsym:  the symbol to use for sentence starts [default: <s>]
    -termsym:   the terminal symbol for sentences [default: </s>]
    -vocabout:  the vocabulary file to write out to
    -ukcutoff:  the cut-off for unknown words [default: 0]
    -uksym:     the symbol to use for unknown words [default: <unk>]
    -ukexpand:  expand unknown symbols in the vocabulary
    -ukmodel:   model unknown words. Arguments are processed first to last, 
                so the most general model should be specified last. 
                Format: "symbol:vocabsize[:regex(.*)][:order(2)][:smoothing(wb)]"

Class options
    -classes:   a file containing word class definitions 
                ("class word [prob]", one per line)

Smoothing options [default: kn]
    -ml:        maximum likelihood smoothing
    -gt:        Good-Turing smoothing (Katz Backoff)
    -wb:        Witten-Bell smoothing
    -abs:       absolute smoothing
    -kn:        Kneser-Ney smoothing (default)
    -mkn:       Modified Kneser-Ney smoothing (of Chen & Goodman)

Output options [default: arpa]
    -bin:       output in binary format
    -wfst:      output in weighted finite state transducer format (WFST)
    -arpa:      output in ARPA format
    -neginf:    the number to print for non-existent backoffs (default: null, example: -99)

Miscellaneous options
    -debug:     the level of debugging information to print [default: 0]

CrossEntropy

A program to calculate the cross-entropy of a corpus using one or more language models.

Usage: java -cp kylm.jar kylm.main.CrossEntropy [OPTIONS] test.txt
Example: CrossEntropy -arpa model1.arpa:model2.arpa test.txt
    -arpa:  models in arpa format (model1.arpa:model2.arpa)
    -bin:   models in binary format (model3.bin:model4.bin)
    -debug: the level of debugging information to print [default: 0]

Kylm API

Development Information

Developers

Additional developers are welcome. If you are interested, please send an email to kylm@.

Kylm is released under the GNU Lesser General Public License

Revision History

Planned Future Features:

Ver. 0.0.6 (5/21/2009)

Ver. 0.0.5 (11/25/2009)

Ver. 0.0.4 (11/13/2009)

Version 0.0.3 (6/22/2009)

Version 0.0.2 (5/28/2009)

Version 0.0.1 (Initial Alpha Release)