KyTea Version History
Future Features/Known Issues
- EUC input can only handle 2-byte EUC, but future versions will handle 3-byte characters as well.
- Improved the efficiency of the dictionary implementation
Version 0.4.7 (12/18/2013)
- Update to the model to make it more robust to various domains.
- Fixed some compilation issues on various platforms.
Version 0.4.6 (5/28/2013)
- Fixed a bug that caused major problems when training on half-width characters
- Added "tags" as an output format, for when you don't need the original word
Version 0.4.5 (4/8/2013)
- Fixed some compile errors, especially under clang
Version 0.4.4 (2/25/2013)
- Major refactoring to make the code easier to read and improve compile time a little
- Added the ability to select -out eda for compatibility with the Eda parser
Version 0.4.3 (2/4/2013)
- Added the -wsconst option, making it possible to not segment some character types (e.g. "-wsconst D" prevents segmentation of digits)
Version 0.4.2 (5/30/2012)
- Normalization of half-width characters
- Model updated for better accuracy
- Added a new input option to take tokenized text with no tags, which is the default when using the -nows tag
- Other minor bug fixes
Version 0.4.1 (3/31/2012)
- A number of small bug fixes
- Added a Windows version (Thanks to knzm for contributing!)
Version 0.4.0 (1/27/2012)
- Improvement of analysis speed (approximately 2x, 5x with -notags)
- Addition of unit tests and bug fixes
- Update of the model (for improved robustness)
Version 0.3.2 (8/09/2011)
- Made it possible to train using feature files as described on the training page.
- Fixed a bug with the "-nounk" option.
Version 0.3.1 (6/17/2011)
- Fixed a few minor bugs.
- Upgraded the API.
Version 0.3.0 (4/26/2011)
- Made it possible to estimate multiple tags at one time, and combined the default POS and pronunciation models into a single model.
- Added support for the global tagging model described in the ACL 2011 paper.
- Fixed a bug that caused a confidence of 100 for some probabilistic models.
Version 0.2.1 (1/21/2011)
- Fixed a few bugs that caused model training to fail on certain data sets.
- The -deftag option was not working properly, so this is fixed as well.
- Better handling of escape characters in corpora.
Version 0.2.0 (1/6/2011)
- Added an API for programmatic access to KyTea
- Upgrade LIBLINEAR to version 1.7 to allow for improved logistic regression (-solver 7), and updated the models on the models page
- Allowed for the specification of the separator between words and tags using options (details here)
- Made support for a default tag (-deftag) when no tag candidates are generated, which is set to "/UNK" by default.
- Made it possible to tune the cost parameter (-cost) for the SVM or LR training.
- Added a -debug option that allows the printing of more details for analyzing KyTea's results.
- Changes in the text model format to make it easier to read (feature weights are now written directly below their names).
- Quantization is now disabled by default, which will reduce speed but increase stability when training models. It can be re-enabled by using the --enable-quantize option of the configure script.
Version 0.1.3 (10/01/2010)
- Cleaned up the code, fixing a number of warnings.
- Fixed a bug that prevented compiling on Mac OS X.
- Added models for POS tagging to the Model page.
Version 0.1.2 (8/18/2010)
- Changed unknown word pronunciation from full search to beam search. The -unkbeam option was added. This should fix crashes due to long unknown words.
- Probabilities were not properly output when using the "-out conf" option with a model trained with "-solver 6" (including the provided logisitic regression models). This was fixed.
- Fixed a bug that didn't allow models to be trained with "-nope".
- Confidence weighted output now has a single blank line after each sentence to ease processing.
- Fixed a bug that caused dictionaries to not be read properly when not containing pronunciations.
- Fixed a bug that broke models trained using the -modtext option.
Version 0.1.1 (5/11/2010)
- Character-based pronunciation modeling for unknown words has been added.
- The dictionary and training data of the default model has been expanded.
Version 0.1.0 (3/5/2010)
- When using multiple dictionaries, they are now treated as separate features (as opposed to a generic dictionary feature in the previous versions).
- Update to the model file format (incompatible with version 0.0.3)
- Partially annotated files may now use '?' in addition to ' ' for unlabeled boundaries (to express human annotator uncertainty).
- Double spaces are handled as a single space in full annotation.
- Escaped characters are allowed in annotated input.
- The output of Logistic regression now reflects probabilities.
- A model is now included with the package, and the model can be specified using an environmental variable.
- Added the ability to output multiple answers in order of preference.
Version 0.0.3 (11/30/2009)
- Speed improvements (approximately 2 times faster than 0.0.1)
- Fix of a bug that caused crashing when -nows was enabled
Version 0.0.2 (11/16/2009)
- Support for Shift-JIS (in addition to the previous EUC-JP and UTF8).
- Build system change to Autotools
Version 0.0.1 (11/05/2009)
- Initial release of KyWs and KyPe.