This is a page for the software that I have made, most of it has been used in my research.
Travatar is a toolkit for tree-to-string translation, which is able to achieve high accuracy between languages with large amounts of reordering. I describe more, including experiments comparing the method with other frameworks for translation in my demo paper at ACL 2013.
lamtram is a toolkit for language modeling or translation modeling using neural networks. We have used it in a number of projects, including our winning submission to the 2015 Workshop on Asian Translation.
lader is a tool to help solve long-distance reordering in machine translation, which also functions as an unsupervised discriminative parser. In our paper at EMNLP 2012 we found that this allowed for gains in accuracy over phrase-based and hierarchical-phrase-based translation.
pialign is a phrase aligner for phrase-based statistical machine translation that can be used with the Moses decoder. It is based on Inversion Transduction Grammars, with a unique hierarchical model that makes it possible to learn phrase tables using a fully statistical model (no heuristics). In our paper at ACL 2011 we found that this compared favorably to the traditional approach, learning more compact models without a loss in accuracy. pialign also has the advantage that it can align strings of characters, not just strings of words, as shown in our ACL 2012 paper where we achieve competitive translation results using substring-to-substring translation.
latticelm is a tool for unsupervised word segmentation using the Bayesian Pitman-Yor Language model. It is essentially an implementation of Mochihashi et al's word segmentation method that can be learned over lattices. Using this tool, we found that it was possible to learn language models and word segmentations from continuous speech (InterSpeech 2010), without using any text.
Kyoto Text Analysis Toolkit (KyTea)
KyTea is a toolkit for text analysis of languages that require word segmentation such as Japanese and Chinese. We provide Japanese and Chinese models for performing word segmentation, pronunciation estimation, and POS tagging (Japanese only), and can be trained to perform other tasks if you have data. The most interesting technical point of KyTea is that it can be trained from partially annotated data, which means that you only have to annotate the important or difficult parts of sentences, instead of whole sentences like traditional methods. We have done research confirming that it allows for both competitive accuracy and efficient annotation on tasks such as pronunciation (LREC 2010), word segmentation, and POS annotation (ACL 2011).
Kyoto FST Decoder (Kyfd)
Kyfd is a tool for decoding weighted finite state transducer models for text processing. It is highly configurable, and can be used with just about any type of model, while taking simple text input that does not require manual construction of FST models. It has been used in my research for speaking style transformation (InterSpeech09, ICASSP10) and OCR error correction (Japanese), and by others for applications such as paraphrasing.
Kyoto Language Modeling Toolkit (Kylm)
Kylm is a simple language modeling toolkit written entirely in Java, implementing n-gram language models with a number of smoothing methods. It is able to create and evaluate character-based models for unknown words automatically. It is also able to export models directly to WFST format for use with Kyfd, OpenFst, or other WFST-based systems.
Other Programs and Scripts
- tmert.py: A program for thresholded minimum error rate training for question answering systems.
- prontron: A program that does pronunciation estimation (mainly in Japanese) using the structured perceptron.
- dirichlet-topic.pl: A simple script that allows you to find representative words for a specific topic (using a model based on Dirichlet processes).