by Graham Neubig (2/15/2012), 日本語
This is a set of sentences from Wikipedia that cover the most common n-gram patterns in a separate corpus of Japanese text. Sentences are chosen in order to cover the most common n-gram (1-4) that has not yet been covered by a previous sentence. When multiple sentences contain the most common n-gram, ties are broken based on how many other uncovered n-grams the new sentence covers.
It consists of three files:
それ だけ の こと で は あ り ま せ ん か 。 し な く て も い い 、 と い う もの で は な い と 思 い ま す 。 地下 鉄 システム の 整備 に よ っ て これ ら の 問題 が 解決 する こと が 期待 さ れ て い る 。
Chose の, covered 45 unique, 640870 (5.49748522669331%) valid n-grams Chose 、, covered 116 unique, 1298170 (11.1359096177641%) valid n-grams Chose に, covered 192 unique, 2112392 (18.1204359901152%) valid n-grams
This data was derived from Wikipedia, and may be freely distributed according to the Creative Commons Attribution Share-Alike License