This is the home of the Kyoto Text Analysis Toolkit (KyTea, pronounced "cutie"). It is a general toolkit developed for analyzing text, with a focus on Japanese and other languages requiring word or morpheme segmentation.
KyTea is able to perform the following types of processing:
Both KyWs and KyPe use a pointwise classifier-based (SVM or logistic regression) approach, allowing for training on partially annotated training data. The classifiers are trained with LIBLINEAR. More details on the approach will be posted in the near future.
The most recent version of KyTea is version 0.0.3, which can be downloaded here: KyTea 0.0.3.
KyTea has currently only been tested on Linux (future versions are planned for Windows and possibly Mac). If you are running Linux, download the most recent version, decompress and make the program:
tar -xzf kytea-X.X.X.tar.gz cd kytea-X.X.X ./configure make make install kytea --help
If this prints a help message, KyTea is working properly.
Building a basic word segmentation and pronunciation estimation model with KyTea is simple. First, you must prepare a corpus with one sentence per line in the following format (if you only want to do word segmentation, the pronunciations are not necessary):
word1/pron1 word2/pron2 word3/pron3 word4/pron4 word5/pron5 word6/pron6
Let's say that this corpus is named train.full (full means that the file is fully annotated in the above format). If we have an unsegmented file named test.raw, we can create a model and analyze the unsegmented file using the following commands.
train-kytea -full train.wp -model model.dat kytea -model model.dat < test.raw > test.full
test.full will now have a segmented file with each word annotated with a pronunciation.
kytea performs word segmentation and pronunciation estimation given a model
Options: -model The model file to use when analyzing text -in The formatting of the input (raw/full/part/prob, default raw) -out The formatting of the output (full/part/prob, default full) -nows Don't do word segmentation (raw input cannot be accepted) -nope Don't do pronunciation estimation (full input cannot be accepted)
train-kytea is a program to train models for KyTea.
Input/Output Options: -encode The text encoding to be used for input/output (utf8/euc/sjis; default: utf8) -full A file of fully annotated training data (can be specified multiple times) -part A file of partially annotated training data (can be specified multiple times) -prob A file of training data annotated with confidences (can be specified multiple times) -dict A dictionary file (in the form of one 'word/pron' entry per line) -model The file to write the trained model to -modtext Print a text model (instead of the default binary) Model Training Options (basic) -nows Don't train a word segmentation model -nope Don't train a pronunciation estimation model Model Training Options (for advanced users): -charw The window of characters to use on either side of a boundary for WS (default 2) -charn The maximum length of character n-grams to use for WS (default 3) -typew The window of character types to use on either side of a boundary for WS (default 3) -typen The maximum length of character type n-grams to use for WS (default 3) -dicn All dictionary words greater than this length will be bucketed together (default 4) -eps The epsilon stopping criterion for classifier training -bias The bias value to use in classifier training (default 1) -solver The solver (0 = logistic regression, 1 = SVM, etc.; default 1)
KyTea is currently developed by Graham Neubig, but any additional developers are welcome. If you are interested, please send an email to kytea@
.