Travatar comes with a program tree-converter with the following options. More information about some of the functionality is included below.
-input_format The format of the input (penn/json/egret) -output_format The format of the output (penn/json/egret/word) -split A regular expression to split words in the tree (e.g. "-") -compoundsplit The language model file for use in compound splitting -compoundsplit_filler Optional fillers for compound splitting, e.g. "e:es" for German -compoundsplit_threshold Words with unigram probability mass above this threshold will not be split -compoundsplit_minchar Mininimum required characters in subword for compound splitting -binarize How to binarize the trees (none/left/right/cky) -case How to case the trees (none/low/title/true:model=FILE) -flatten Whether to flatten unary productions -debug How much debug output to produce
Using the -binarize option allows the tree to be binarized according to a number of options.
Using the -case option allows the words in the tree to be all lowercased (low), the first word to be title cased (title), or the first word to be true cased according to a model (true).
The -split option allows us to specify an arbitrary regular expression. Given this regular expression, all words matching this expression will be split in half around the expression. This is useful for splitting hyphenated or slashed expressions, which are not split by the Penn treebank, or other parsing standards.
WordSplitterCompound works similarly to Moses compound-splitter.perl, i.e. it compares the unigram probability of the word (e.g. "autobahn") with the mean unigram probability of its subwords (e.g. "auto" and "bahn") and picks the one that is higher. It also considers fillers between words and deletes them if necessary (e.g. "arbeitstier" splits into "arbeit"+ (filler=s) + "tier").
bin/tree-converter -compoundsplit LMFile -compoundsplit_filler "es:s:e" parsed.de > parsed.split.de
The language model (LMfile) provides the unigram statistics and should be trained on text that matches parsed.de tokenization beforehand. Since we use KenLM, bigram or above is assumed, even though the algorithm only looks at unigrams. The fillers are specified in a colon (:) delimited format.
Additional options for this class are compoundsplit_threshold and compoundsplit_minchar, which determine which words are candidates splitting. Usually we don't want to consider a word for splitting if its unigram probability is above some high threshold, or if its subwords are too short. The default values are probably fine. Using "-debug 1" option will generate statistics on the number of words split.