pialign - Phrasal ITG Aligner

pialign is a package that allows you to create a phrase table and word alignments from an unaligned parallel corpus. It is unlike other unsupervised word alignment tools in that it is able to create a phrase table using a fully statistical model, no heuristics. As a result, it is able to build phrase tables for phrase-based machine translation that achieve competitive results but are only a fraction of the size of those created with heuristic methods.

pialign was developed mainly by Graham Neubig during his period as an intern at NICT.

If you would like more details about the method or want to cite pialign in your research, please reference:

Download/Install

Download

Latest Version: pialign 0.2.4

Bleeding-Edge Code: @github
Past Versions: pialign 0.2.3 pialign 0.2.2 pialign 0.2.1 pialign 0.2.0 pialign 0.1.0

The code of pialign is distributed according to the Common Public License v 1.0, and can be distributed freely according to this license.

Install

On Linux, Mac OS X, or Cygwin, download the source code, and install using the following commands.

tar -xzf pialign-X.X.X.tar.gz
cd pialign-X.X.X
./configure
make
make install
pialign --help

If this prints a help message, pialign is working properly.

Program Documentation

Using the Program

To align parallel text, say we have a source language file "data/f.txt" and a target language file "data/e.txt", we can run the program as follows:

$ mkdir out
$ pialign data/f.txt data/e.txt out/align. &> out/log.txt &

The program will run for a while (about 1-2 hours for 10k sentences, or 10-20 hours for 100k). Note that by default, only sentences of 40 words or less will be aligned. This can be changed by using "-maxsentlen XX" where XX is the maximum sentence length, but if this is set to a large value the program may use a large amount of memory, and the speed will drop to some extent.

When the program is done running, it will output two files "out/align.1.samp" and "out/align.1.pt." align.1.samp contains the alignments (in the form of trees) created by the aligner, and align.1.pt contains a phrase table that can be used with Moses.

In order for Moses to run, you will also need a language model, which you can train using SRILM or IRSTLM (you can reference the Moses step-by-step tutorial for more details on how to train a model). In addition, to achieve high accuracies, you should probably calculate a lexicalized reordering model using the "itgstats.pl" script on the derivation file. Here, "7" is the maximum length of a phrase, and 0.5 is the "pseudo-count" to be added for smoothing.

  $ script/itgstats.pl lex 7 0.5 < out/align.1.samp | LC_ALL=C sort > out/align.1.lex

Finally, you can combine the phrase table (TM), language model (LM), and reordering model (DM) together using the following command.

  $ script/make-moses-config.pl /full/path/to/out/align.1.pt /full/path/to/lm.txt --dm-file /full/path/to/out/align.1.lex > out/moses.ini

There are also a number of settings controlling the LM order, TM order, etc. described in the beginning of make-moses-config.pl, so please take a look to make sure there is nothing that you need to change. If all goes well, Moses can be run using this moses.ini file. If not, feel free to contact pialign-users for help at any time.

Options

The following options can be used with pialign.

~~~ Input/Output ~~~

Usage: pialign [OPTIONS] FFILE EFILE PREFIX

 FFILE is the foreign input corpus
 EFILE is the english input corpus
 PREFIX is the prefix that will be used for the output

Other input:
 -le2f         A file containing the lexicon probabilities for e2f
 -lf2e         A file containing the lexicon probabilities for f2e
               (These can be used with "-base m1" or "-base m1g" but are not necessary)

~~~ Model Parameters ~~~

 -model        Model type (hier/len/flat, default: hier)

 -avgphraselen A parameter indicating the expected length of a phrase.
               default is small (0.01) to prevent overly long alignments
 -base         The type of base measure to use (m1g is generally best).
               'm1g'=geometric mean of model 1, 'm1'=arithmetic mean of model 1,
               'uni'=simple unigrams, 'coocll'=log-linear interpolation of phrase
               cooccurrence probabilities in both directions (default 'm1g')
 -coocdisc     How much to discount the cooccurrence for phrasal base measures
 -defstren     Fixed strength of the PY process (default none)
 -defdisc      Fixed discount of the PY process (default none)
 -nullprob     The probability of a null alignment (default 0.01)
 -noremnull    Do not remember nulls in the phrase table
 -termprior    The prior probability of generating a terminal (0.33)
 -termstren    Strength of the type distribution (default 1)
 -domh         Do a Metropolis-Hastings rejection step

~~~ Phrase Table ~~~

 -maxphraselen The maximum length of a minimal phrase (default 7)
 -maxsentlen   The maximum length of sentences to use (default 40)
 -printmax     The maximum length of phrases included in the phrase table (default 7)
 -printmin     The minimal length of phrases included in the phrase table (default 1)
 -noword       Output only phrase alignments (do not force output of word alignments)

~~~ Inference Parameters ~~~

 -burnin       The number of burn-in iterations (default 9)
 -probwidth    The width of the probability beam to use (default 1e-4)
 -noqueue      Use exhaustive search instead of queue-based parsing
 -lookahead    The type of lookahead function to use:
               'none'=no look-ahead, 'ind'=independently calculate both sides
 -samps        The number of samples to take (default 1)
 -samprate     Take samples every samprate turns (default 1)
 -worditers    The number of iterations to perform with a word-based model (default 0)
 -noshuffle    Don't shuffle the order of the sentences
 -batchlen     The number of sentences to process in a single batch
 -threads      The number of threads to use (must be <= -batchlen)

FAQ

I want to create word alignments.

This can be done by running the script "script/itgstats.pl" on "align.1.samp." There are three types of word alignments that you can create, many-to-many (phrase) alignments, one-to-many (block) alignments, and one-to-one (word) alignments.

  many-to-many: $ script/itgstats.pl palign < out/align.1.samp > out/align.1.pal
  one-to-many:  $ script/itgstats.pl balign < out/align.1.samp > out/align.1.bal
  one-to-one:   $ script/itgstats.pl align < out/align.1.samp > out/align.1.wal

If you want to visualize these alignments you can do so as follows:

  $ script/visualize.pl data/e.txt data/f.txt out/align.1.pal > out/align.1.vis

What are the scores in the phrase table?

These are explained in detail in the referenced paper, but briefly:

   1,2: The conditional probabilities of the phrases p_t(e|f), p_t(f|e)
   3: The joint probability of the phrase pair p_t(e,f)
   4: The average posterior probability of a span containing e,f
   5,6: Lexical weighting probabilities using model 1 word probabilities
        (only output if the base measure uses model 1)
   7: The uniform phrase penalty

What is the format of the derivation tree in the .samp file?

The bracketed representation in the derivation file has four different types of brackets:

   [X Y]: Indicates that children X and Y were generated by a regular ITG symbol
   <X Y>: Indicates that children X and Y were generated by a reverse ITG symbol
   ((( e ||| f ))): Indicates a phrase pair e, f
   { X }: Indicates that X was generated as a single phrase in the model. Words
          inside these brackets were actually aligned together as a single
          phrase, but by default pialign forces alignments down to words to
          preserve word alignments for possible other uses.

pialign is too slow!

The easiest way speed up pialign is to use multi-threading. This can be done by setting the "-batchlen" and "-threads" parameters. "-threads" should be set to the number of cores you want to use. "-batchlen" must be as large as "-threads." The larger the value is, the faster multi-threaded processing will be, but very large values might cause a small decrease in accuracy. It is likely that a value 10-40 times the number of threads would be appropriate. For example, if you want to use 4 cores, you can set "-threads 4 -batchlen 40".

Also, if you cannot use multiple cores and you want a speed up at the possible small drop in alignment accuracy, you can reduce the size of the probability beam using the -probwidth option. The default is "-probwidth 1e-4" so if you set this to "-probwidth 1e-3" instead, you will likely see a significant speed up.

Finally, using the "-viterbi" option can speed up alignment significantly (about 2.5 times?). I have not done a complete test of how this affects accuracy, but it will probably drop a little.

There is no "configure" file to build the program.

If you checked the source directly out from the repository, you may have to run autotools before building. First, make sure autotools is installed on your computer, then run "autoreconf -i" which will prepare the configure file for you.

Development/Support

Contact

If you have a question about pialign, please submit it to the pialign-users mailing list. (If you don't get a response, you can also try contacting Graham at neubig at gmail dot com).

Contributors

Revision History

Future Features/Known Issues

Version 0.2.4 (11/7/2012)

Version 0.2.3 (9/14/2012)

Version 0.2.2 (9/8/2012)

Version 0.2.1 (9/2/2012)

Version 0.2.0 (7/6/2012)

Version 0.1.0 (5/13/2011)