KyTea

日本語

This is the home of the Kyoto Text Analysis Toolkit (KyTea, pronounced "cutie"). It is a general toolkit developed for analyzing text, with a focus on Japanese, Chinese and other languages requiring word or morpheme segmentation.

Analysis: Method Details, IO Formats, API
Training: Training Models, Extra Models

Domain Adaptation with KyTea
Development

Features

KyTea is able to perform the following types of processing:

Word Segmentation: It can separate an unsegmented text stream into appropriate units (words or morphemes).
Tagging: It can estimate the tags for words such as POS tags and pronunciations. For pronunciations, it has the ability to estimate the pronunciation of unknown words.

All functionality is performed using a pointwise classifier-based (SVM or logistic regression) approach, allowing for training on partially annotated training data. The classifiers are trained with LIBLINEAR. More details KyTea's classification approach can be found here or in the following paper (please cite the ACL paper if you are talking about POS or other tagging, or the LREC paper if you are talking about pronunciation estimation).

Graham Neubig，Yosuke Nakata, Shinsuke Mori．
Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis
The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). Portland, Oregon, USA. June 2011
Graham Neubig，Shinsuke Mori．
Word-based Partial Annotation for Efficient Corpus Construction
The seventh international conference on Language Resources and Evaluation (LREC 2010). Malta. May 2010.

Download/Install

Download

Latest Version: KyTea 0.4.7 (Source) KyTea 0.4.2 (Windows Binary)

This software package contains source code, and a default model that uses the UTF-8 character encoding, and estimates POS tags as well as pronunciations according to keyboard input (which is slightly different than the actual phonetic pronunciations). More details, and a number of other models can be found on the KyTea Models page.

Bleeding-Edge Code: @github
Past Versions: 0.4.6 (Source) 0.4.5 (Source) 0.4.4 (Source) 0.4.3 (Source) 0.4.2 (Source Win) 0.4.1 (Source Win) 0.4.0 0.3.2 0.3.1 0.3.0 0.2.1 0.2.0 0.1.3 0.1.2 0.1.1 0.1.0 0.0.3 0.0.2 0.0.1

The code of KyTea is distributed according to the Apache License Version 2, and can be distributed freely according to this license. The models included with KyTea or distributed on the KyTea models page may be used for research or commercial purposes (except where noted otherwise), but may not be re-distributed without prior permission.

Install

KyTea has been tested on Linux, Mac OSX, and Windows (via Cygwin). On Linux or Cygwin, download the source code, and install using the following commands.

tar -xzf kytea-X.X.X.tar.gz
cd kytea-X.X.X
./configure
make
make install
kytea --help

If this prints a help message, KyTea is working properly. There are a number of options that can be set during compile-time to adjust the install location or program efficiency.

Program Documentation

Using the Program

After you have installed KyTea, run the program to split the text into words and annotate each word with a POS and pronunciation. If test.raw is a file that contains raw text, the following command will create annotated text in the file test.full.

kytea < test.raw > test.full

Training a Model

While KyTea comes with a default model, if you have your own annotated text it is both simple and useful to build your own model (more detail on training KyTea can be found here). First, you must prepare a corpus with one sentence per line with a word followed by one or more tags:

word1/pos1/pron1 word2/pos2/pron2 word3/pos3/pron3
word4/pos4/pron4 word5/pos5/pron5 word6/pos6/pron6

Let's say that this corpus is named train.full (full means that the file is fully annotated in the above format). If we have an unsegmented file named test.raw, we can create a model and analyze the unsegmented file using the following commands.

train-kytea -full train.wp -model model.dat
kytea -model model.dat < test.raw > test.full

test.full will now have a segmented file with each word annotated with a POS and pronunciation.

Usage

kytea

kytea performs word segmentation and tagging

Analysis Options: 
  -model   The model file to use when analyzing text
  -nows    Don't do word segmentation (raw input cannot be accepted)
  -notags  Don't do tagging (full input cannot be accepted)
  -notag   Skips a particular tag (-notag 1 will skip the first tag)
  -nounk   Don't estimate the pronunciation of unkown words
  -wsconst Do not segment some character types (e.g. "D" to not segment digits)
  -unkbeam The width of the beam to use in beam search for unknown words
           (default 50, 0 for full search)
Format Options: 
  -in      The formatting of the input  (raw/full/part/conf/tok, default raw)
  -out     The formatting of the output (full/part/conf/tok/eda/tags, default full)
  -tagmax  The maximum number of tags to print for one word (default 3, 0 implies no limit)
  -deftag  A tag for words that cannot be given any tag (for example, 
           unknown words that contain a character not in the subword dictionary)
  -unktag  A tag to append to indicate words not in the dictionary
Format Options (for advanced users): 
  -wordbound The separator for words in full annotation (" ")
  -tagbound  The separator for tags in full/partial annotation ("/")
  -elembound The separator for candidates in full/partial annotation ("&")
  -unkbound  Indicates unannotated boundaries in partial annotation (" ")
  -skipbound Indicates skipped boundaries in partial annotation ("?")
  -nobound   Indicates non-existence of boundaries in partial annotation ("-")
  -hasbound  Indicates existence of boundaries in partial annotation ("|")

train-kytea

train-kytea is a program to train models for KyTea.

Input/Output Options: 
  -encode  The text encoding to be used (utf8/euc/sjis; default: utf8)
  -full    A fully annotated training corpus (can be used multiple times)
  -tok    A tokenized training corpus (can be used multiple times)
  -part    A partially annotated training corpus (can be used multiple times)
  -conf    A confidence annotated training corpus (can be used multiple times)
  -feat    A feature file generated by -featout
  -dict    A dictionary file (one 'word/pron' entry per line, multiple possible)
  -subword A file of subword units. This will enable unknown word PE.
  -model   The file to write the trained model to
  -modtext Print a text model (instead of the default binary)
  -featout Write the features used in training the model to this file
Model Training Options (basic)
  -nows    Don't train a word segmentation model
  -notags  Don't train a tagging model
  -global  Train the nth tag with a global model (good for POS, bad for PE)
  -debug   The debugging level during training (0=silent, 1=normal, 2=detailed)
Model Training Options (for advanced users): 
  -charw   The character window to use for WS (3)
  -charn   The character n-gram length to use for WS for WS (3)
  -typew   The character type window to use for WS (3)
  -typen   The character type n-gram length to use for WS for WS (3)
  -dictn   Dictionary words greater than -dictn will be grouped together (4)
  -unkn    Language model n-gram order for unknown words (3)
  -eps     The epsilon stopping criterion for classifier training
  -cost    The cost hyperparameter for classifier training
  -bias    Whether to use a bias value in classifier training (true)
  -solver  The solver (1=SVM, 7=logistic regression, etc.; default 1,
           see LIBLINEAR documentation for more details)
Format Options (for advanced users): 
  -wordbound The separator for words in full annotation (" ")
  -tagbound  The separator for tags in full/partial annotation ("/")
  -elembound The separator for candidates in full/partial annotation ("&")
  -unkbound  Indicates unannotated boundaries in partial annotation (" ")
  -skipbound Indicates skipped boundaries in partial annotation ("?")
  -nobound   Indicates non-existence of boundaries in partial annotation ("-")
  -hasbound  Indicates existence of boundaries in partial annotation ("|")

Development

Contributors

Graham Neubig (project leader, all coding)
Shinsuke Mori (oversight, power user)
Tetsuro Sasada (preparation of language resources)

If you are interested in participating in the KyTea project, please send an email to kytea@.

Revision History

Future Features/Known Issues

EUC input can only handle 2-byte EUC, but future versions will handle 3-byte characters as well.
Improved the efficiency of the dictionary implementation

Version 0.4.7 (12/18/2013)

Update to the model to make it more robust to various domains.
Fixed some compilation issues on various platforms.

Older versions can be found on the version history page.