This is a page for the software that I have made, most of it has been used in my research.
Kyoto Text Analysis Toolkit (KyTea)
KyTea is a toolkit for text analysis of languages that require word segmentation such as Japanese and Chinese. We provide Japanese and Chinese models for performing word segmentation, pronunciation estimation, and POS tagging (Japanese only), and can be trained to perform other tasks if you have data.
The most interesting technical point of KyTea is that it can be trained from partially annotated data, which means that you only have to annotate the important or difficult parts of sentences, instead of whole sentences like traditional methods. We have done research confirming that it allows for both competitive accuracy and efficient annotation on tasks such as pronunciation estimation, word segmentation (Japanese) and POS annotation (Japanese).
pialign
pialign is a phrase aligner for phrase-based statistical machine translation that can be used with the Moses decoder. It is based on Inversion Transduction Grammars, with a unique hierarchical model that makes it possible to learn phrase tables using a fully statistical model (no heuristics). In our experiments we found that this compared favorably to the traditional approach, learning more compact models without a loss in accuracy.
latticelm
latticelm is a tool for unsupervised word segmentation using the Bayesian Pitman-Yor Language model. It is essentially an implementation of Mochihashi et al's word segmentation method that can be learned over lattices. Using this tool, we found that it was possible to learn language models and word segmentations from continuous speech, without using any text.
Kyoto FST Decoder (Kyfd)
Kyfd is a tool for decoding weighted finite state transducer models for text processing. It is highly configurable, and can be used with just about any type of model, while taking simple text input that does not require manual construction of FST models. It has been used in my research for speaking style transformation (InterSpeech09, ICASSP10) and OCR error correction (Japanese), and by others for applications such as paraphrasing.
Kyoto Language Modeling Toolkit (Kylm)
Kylm is a simple language modeling toolkit written entirely in Java, implementing n-gram language models with a number of smoothing methods. It is able to create and evaluate character-based models for unknown words automatically. It is also able to export models directly to WFST format for use with Kyfd, OpenFst, or other WFST-based systems.
Other Programs and Scripts
- prontron: A program that does pronunciation estimation (mainly in Japanese) using the structured perceptron.
- dirichlet-topic.pl: A simple script that allows you to find representative words for a specific topic (using a model based on Dirichlet processes).
- Vocabtron: a game to test your vocabulary in English, French, or Japanese