This is the home of the Kyoto Text Analysis Toolkit (KyTea, pronounced "cutie"). It is a general toolkit developed for analyzing text, with a focus on Japanese, Chinese and other languages requiring word or morpheme segmentation.
KyTea is able to perform the following types of processing:
All functionality is performed using a pointwise classifier-based (SVM or logistic regression) approach, allowing for training on partially annotated training data. The classifiers are trained with LIBLINEAR. More details KyTea's classification approach can be found here or in the following paper (please cite the ACL paper if you are talking about POS or other tagging, or the LREC paper if you are talking about pronunciation estimation).
Latest Version: KyTea 0.4.6 (Source) KyTea 0.4.2 (Windows Binary)
This software package contains source code, and a default model that uses the UTF-8 character encoding, and estimates POS tags as well as pronunciations according to keyboard input (which is slightly different than the actual phonetic pronunciations). More details, and a number of other models can be found on the KyTea Models page.
Bleeding-Edge Code: @github
Past Versions: 0.4.5 (Source) 0.4.4 (Source) 0.4.3 (Source) 0.4.2 (Source Win) 0.4.1 (Source Win) 0.4.0 0.3.2 0.3.1 0.3.0 0.2.1 0.2.0 0.1.3 0.1.2 0.1.1 0.1.0 0.0.3 0.0.2 0.0.1
The code of KyTea is distributed according to the Apache License Version 2, and can be distributed freely according to this license. The models included with KyTea or distributed on the KyTea models page may be used for research or commercial purposes (except where noted otherwise), but may not be re-distributed without prior permission.
KyTea has been tested on Linux, Mac OSX, and Windows (via Cygwin). On Linux or Cygwin, download the source code, and install using the following commands.
tar -xzf kytea-X.X.X.tar.gz cd kytea-X.X.X ./configure make make install kytea --help
If this prints a help message, KyTea is working properly. There are a number of options that can be set during compile-time to adjust the install location or program efficiency.
After you have installed KyTea, run the program to split the text into words and annotate each word with a POS and pronunciation. If test.raw is a file that contains raw text, the following command will create annotated text in the file test.full.
kytea < test.raw > test.full
While KyTea comes with a default model, if you have your own annotated text it is both simple and useful to build your own model (more detail on training KyTea can be found here). First, you must prepare a corpus with one sentence per line with a word followed by one or more tags:
word1/pos1/pron1 word2/pos2/pron2 word3/pos3/pron3 word4/pos4/pron4 word5/pos5/pron5 word6/pos6/pron6
Let's say that this corpus is named train.full (full means that the file is fully annotated in the above format). If we have an unsegmented file named test.raw, we can create a model and analyze the unsegmented file using the following commands.
train-kytea -full train.wp -model model.dat kytea -model model.dat < test.raw > test.full
test.full will now have a segmented file with each word annotated with a POS and pronunciation.
kytea performs word segmentation and tagging
Analysis Options: -model The model file to use when analyzing text -nows Don't do word segmentation (raw input cannot be accepted) -notags Don't do tagging (full input cannot be accepted) -notag Skips a particular tag (-notag 1 will skip the first tag) -nounk Don't estimate the pronunciation of unkown words -wsconst Do not segment some character types (e.g. "D" to not segment digits) -unkbeam The width of the beam to use in beam search for unknown words (default 50, 0 for full search) Format Options: -in The formatting of the input (raw/full/part/conf/tok, default raw) -out The formatting of the output (full/part/conf/tok/eda/tags, default full) -tagmax The maximum number of tags to print for one word (default 3, 0 implies no limit) -deftag A tag for words that cannot be given any tag (for example, unknown words that contain a character not in the subword dictionary) -unktag A tag to append to indicate words not in the dictionary Format Options (for advanced users): -wordbound The separator for words in full annotation (" ") -tagbound The separator for tags in full/partial annotation ("/") -elembound The separator for candidates in full/partial annotation ("&") -unkbound Indicates unannotated boundaries in partial annotation (" ") -skipbound Indicates skipped boundaries in partial annotation ("?") -nobound Indicates non-existence of boundaries in partial annotation ("-") -hasbound Indicates existence of boundaries in partial annotation ("|")
train-kytea is a program to train models for KyTea.
Input/Output Options: -encode The text encoding to be used (utf8/euc/sjis; default: utf8) -full A fully annotated training corpus (can be used multiple times) -tok A tokenized training corpus (can be used multiple times) -part A partially annotated training corpus (can be used multiple times) -conf A confidence annotated training corpus (can be used multiple times) -feat A feature file generated by -featout -dict A dictionary file (one 'word/pron' entry per line, multiple possible) -subword A file of subword units. This will enable unknown word PE. -model The file to write the trained model to -modtext Print a text model (instead of the default binary) -featout Write the features used in training the model to this file Model Training Options (basic) -nows Don't train a word segmentation model -notags Don't train a tagging model -global Train the nth tag with a global model (good for POS, bad for PE) -debug The debugging level during training (0=silent, 1=normal, 2=detailed) Model Training Options (for advanced users): -charw The character window to use for WS (3) -charn The character n-gram length to use for WS for WS (3) -typew The character type window to use for WS (3) -typen The character type n-gram length to use for WS for WS (3) -dictn Dictionary words greater than -dictn will be grouped together (4) -unkn Language model n-gram order for unknown words (3) -eps The epsilon stopping criterion for classifier training -cost The cost hyperparameter for classifier training -bias Whether to use a bias value in classifier training (true) -solver The solver (1=SVM, 7=logistic regression, etc.; default 1, see LIBLINEAR documentation for more details) Format Options (for advanced users): -wordbound The separator for words in full annotation (" ") -tagbound The separator for tags in full/partial annotation ("/") -elembound The separator for candidates in full/partial annotation ("&") -unkbound Indicates unannotated boundaries in partial annotation (" ") -skipbound Indicates skipped boundaries in partial annotation ("?") -nobound Indicates non-existence of boundaries in partial annotation ("-") -hasbound Indicates existence of boundaries in partial annotation ("|")
If you are interested in participating in the KyTea project, please send an email to kytea@.
Last Modified: 2010-01-21 by neubig