Return to KyTea
This is a toolkit for using active learning and partial annotation to do efficient domain adaptation.
One of the most attractive points of KyTea is that it is able to be adapted to new domains easily. In particular, as KyTea can be learned form partial annotations, it is possible to achieve good results with less annotation than previous methods. Traditional full annotation for word annotation takes the following form:
この 時期 の 中心 人物 は 、 風穴 延昭 で あ る 。
It can be seen that a space is placed between all words (where a word boundary exists), and there is no space where no word boundary exists. However, in order to do this, a whole sentence must be annotated at a time, while in this sentence there is only one word that is difficult to segment: "風穴." When using partial annotation, we are able to annotate only this difficult word and skip the rest of the words:
こ の 時 期 の 中 心 人 物 は 、|風-穴|延 昭 で あ る 。
Here, a pipe "|" indicates that a word boundary exists, a hyphen "-" indicates that word boundaries don't exist, while a space " " indicates a section that we have not yet annotated.
Before starting on domain adaptation, you must first install KyTea, then download and un-tar the following file.
Most Recent Version KyTea Domain Adaptation Toolkit v. 1.1
Please take the following steps to get ready to start annotating. Note that all files should be in the UTF-8 encoding.
KYTEA=$PATH_TO_KYTEA_BIN/kytea TRAIN=$PATH_TO_KYTEA_BIN/train-kyteaIf KyTea is installed in your default path (it can be run by just typing "kytea)" then there is no need to modify these lines.
GEN_CORPORA="-full data/BCCWJ.word"Included in the package is a Wikipedia corpus, which we are able to distrubte under the Creative Commons License. However, it is extremely small, and the results were automatically generated and corrected very simply by hand, it cannot be considered 100% correct. We recommend that you replace this with any high-quality language resources that you have at your disposal. In particular, the "Balanced Corpus of Contemporary Written Japanese (BCCWJ)" has extremely high quality annotations over a fair number of domains, so we recommend it for use with KyTea.
DICTS="-dict data/dict-1.txt -dict data/dict-2.txt"There are several things to note here. First, if it is possible to use a large, high-quality dictionary that matches the segmentation standard (for example, UniDic), this will greatly improve the segmentation accuracy compared with when one is not used. Second, unlike previous methods for morphological analysis, KyTea is able to use other dictionaries that don't necessarily match the segmentation standard of the corpus without decreasing accuracy. For example, if you have an in-domain dictionary that includes compound words, this can still be used to improve accuracy.
Once you have finished preparing, it is time to start annotation. If you follow the 3 steps below 5-50 times you will notice a large increase in accuracy on your target domain. Of course, more is better, but you will get diminishing returns.
[user@machine]$ ./makemodel.shThe file in work/XXX.mod (where XXX is a number) is the model trained using all the resources that you have finished annotating.This step takes approximately five minutes, but more or less depending on the size of the corpora that you are using.
[user@machine]$ ./saveannot.shIf you get an error such as 「Double boundary ' |' at XXX」 this means that you made a mistake annotating line XXX, so check to make sure that you inserted only one boundary character " ", "|" or "-" between each text character. saveannot.sh will need to be run one more time after this. Once this step is finished, return to step 1.
After you have finished annotating, you can find the fruit of your labors in the save/ folder. The file XXX.wann with the highest number is the file containing all sections annotated in this process. You can use this for later KyTea training using the "-part save/XXX.wann" option.
Return to KyTea
Last Modified: 2011-2-16 by neubig