KyTea

日本語

This is the home of the Kyoto Text Analysis Toolkit (KyTea, pronounced "cutie"). It is a general toolkit developed for analyzing text, with a focus on Japanese and other languages requiring word or morpheme segmentation.

Features

KyTea is able to perform the following types of processing:

Both KyWs and KyPe use a pointwise classifier-based (SVM or logistic regression) approach, allowing for training on partially annotated training data. The classifiers are trained with LIBLINEAR. More details on the approach will be posted in the near future.

Download/Install

Download

The most recent version of KyTea is version 0.0.3, which can be downloaded here: KyTea 0.0.3.

Install

KyTea has currently only been tested on Linux (future versions are planned for Windows and possibly Mac). If you are running Linux, download the most recent version, decompress and make the program:

tar -xzf kytea-X.X.X.tar.gz
cd kytea-X.X.X
./configure
make
make install
kytea --help

If this prints a help message, KyTea is working properly.

Program Documentation

Sample

Building a basic word segmentation and pronunciation estimation model with KyTea is simple. First, you must prepare a corpus with one sentence per line in the following format (if you only want to do word segmentation, the pronunciations are not necessary):

word1/pron1 word2/pron2 word3/pron3
word4/pron4 word5/pron5 word6/pron6

Let's say that this corpus is named train.full (full means that the file is fully annotated in the above format). If we have an unsegmented file named test.raw, we can create a model and analyze the unsegmented file using the following commands.

train-kytea -full train.wp -model model.dat
kytea -model model.dat < test.raw > test.full

test.full will now have a segmented file with each word annotated with a pronunciation.

Usage

kytea

kytea performs word segmentation and pronunciation estimation given a model

Options: 
  -model   The model file to use when analyzing text
  -in      The formatting of the input  (raw/full/part/prob, default raw)
  -out     The formatting of the output (full/part/prob, default full)
  -nows    Don't do word segmentation (raw input cannot be accepted)
  -nope    Don't do pronunciation estimation (full input cannot be accepted)

train-kytea

train-kytea is a program to train models for KyTea.

Input/Output Options: 
  -encode  The text encoding to be used for input/output (utf8/euc/sjis; default: utf8)
  -full    A file of fully annotated training data (can be specified multiple times)
  -part    A file of partially annotated training data (can be specified multiple times)
  -prob    A file of training data annotated with confidences (can be specified multiple times)
  -dict    A dictionary file (in the form of one 'word/pron' entry per line)
  -model   The file to write the trained model to
  -modtext Print a text model (instead of the default binary)
Model Training Options (basic)
  -nows    Don't train a word segmentation model
  -nope    Don't train a pronunciation estimation model
Model Training Options (for advanced users): 
  -charw   The window of characters to use on either side of a boundary for WS (default 2)
  -charn   The maximum length of character n-grams to use for WS (default 3)
  -typew   The window of character types to use on either side of a boundary for WS (default 3)
  -typen   The maximum length of character type n-grams to use for WS (default 3)
  -dicn    All dictionary words greater than this length will be bucketed together (default 4)
  -eps     The epsilon stopping criterion for classifier training
  -bias    The bias value to use in classifier training (default 1)
  -solver  The solver (0 = logistic regression, 1 = SVM, etc.; default 1)

Development Information

KyTea is currently developed by Graham Neubig, but any additional developers are welcome. If you are interested, please send an email to kytea@.

Revision History

Future Features

Version 0.0.3 (11/30/2009)

Version 0.0.2 (11/16/2009)

Version 0.0.1 (11/05/2009)