Japanese Balanced Sentences

by Graham Neubig (2/15/2012), 日本語

This is a set of sentences from Wikipedia that cover the most common n-gram patterns in a separate corpus of Japanese text. Sentences are chosen in order to cover the most common n-gram (1-4) that has not yet been covered by a previous sentence. When multiple sentences contain the most common n-gram, ties are broken based on how many other uncovered n-grams the new sentence covers.


It consists of three files:


This data was derived from Wikipedia, and may be freely distributed according to the Creative Commons Attribution Share-Alike License