Writing Unit Tests for Sampling

One popular method for estimating statistical models is sampling, and in particular Gibbs sampling. As I’ve written before, sampling is based on randomness, which makes it somewhat difficult to debug.

On the other hand, one popular way to ensure the quality of code is through arrays of unit tests, which are designed to test your code at a very fine-grained level (e.g. one test for each function in the code base). However, because sampling is by nature random, the output of a sampling algorithm can be expected to change every time you run it, and thus it is not trivial to write unit tests. This article shows two simple ways that I’ve used to overcome this difficulty and write unit tests for sampling algorithms.

First, let’s assume that we have a sampling function called SampleMultinomial, which samples a single value from a multinomial distribution, defined by an array of doubles. In the program below, we sample from the distribution {0.5, 0.3, 0.2}, so we should expect the program to print “0″ 50% of the time, “1″ 30% of the time, and “2″ 20% of the time.

#include <vector>
#include <iostream>
#include <cstdlib>

using namespace std;

int SampleMultinomial(const vector<double> & distribution) {
    double value = (double)rand()/RAND_MAX;
    for(int i = 0; i < distribution.size(); i++)
        if( (value -= distribution[i]) <= 0 )
            return i;
    cerr << "Overflow in SampleMultinomial()" << endl;
    return -1;
}

int main() {
    srand( time(NULL) );
    vector<double> my_distribution(3);
    my_distribution[0] = 0.5;
    my_distribution[1] = 0.3;
    my_distribution[2] = 0.2;
    cout << SampleMultinomial(my_distribution) << endl;
    return 0;
}

Now, how do we test this function? We can do so by utilizing the fact that sampling enough times will help us recover the true probability distribution (the law of large numbers), or we can utilize the fact that the “random” numbers that the computer is generating are actually pseudo-random.

Utilizing the Law of Large Numbers

The original motivation for sampling is that if we take enough samples, we can approximate the true distribution. In our previous example, this means that if we take the count of the number of times that SampleMultinomial returns “0″ (c(x=”0″), and divide it by the total number of times, that we called SampleMultinomial (C), this will approach the true probability of “0″ (p(x=”0″)), as C approaches infinity:

p(x=0) = \lim_{C \to \infty} c(x=0)/C

Now, let’s try to write this in code. We do this by writing a loop that calls SampleMultinomial “C” times, and counts the number of times we get each result. Finally, for each of “x=0″ “x=1″ and “x=2″, we check that its probability is sufficiently close to the true expected probability.

bool TestLawOfLargeNumbers(const vector<double> & distribution) {
    // Call SampleMultinomial C times, and count the number of times we get
    // each result
    int C = 1000000;
    vector<int> counts(distribution.size(), 0);
    for(int i = 0; i < C; i++) {
        int x = SampleMultinomial(distribution);
        counts[x]++;
    }
    // Check to make sure that the value of the probabilities match the true
    // probabilities of the distribution within a confidence threshold
    double confidence = 0.005;
    bool passed = true;
    for(int i = 0; i < distribution.size(); i++) {
        double estimated_probability = (double)counts[i]/C;
        if(abs(distribution[i] - estimated_probability) > confidence) {
            cerr << "FAILED: Difference between at " << i << " ("
                 << abs(distribution[i] - estimated_probability)
                 << ") is more than confidence " << confidence << endl;
            passed = false;
        }
    }
    return passed;
}

It should be noted that this works for discrete distributions where we can explicitly count the number of times each instance is generated. If we are handling continuous distributions, it may be possible to handle them in this way by measuring the Kullback-Leibler divergence from the true distribution, or quantizing the distribution, but I have never tried these.

Utilizing Pseudo-Randomness

Using the law of large numbers relatively simple to implement, and can be used without any knowledge of the internals of the algorithm. On the other hand, when generating samples is expensive, it may be relatively slow to acquire enough samples to ensure that we will get close enough to the true distribution. Also, as mentioned above, continuous distributions are not easy to handle. An alternative method that is useful in these cases is to utilize pseudo-randomness to ensure that we will get the same result every time, then use the expected result to test our algorithm.

To do so, we first set the random seed of our pseudo-random number generator to a specific value, say 1234, and generate as many numbers as we need in sequence.

int main() {
    srand( 1234 );
    const int C = 10;
    for(int i = 0; i < C; i++)
        cout << (double)rand()/RAND_MAX << endl;
    return 0;
}

This gives us an output of:

0.223118
0.216796
0.447559
0.492617
0.569365
0.473787
0.474919
0.0336593
0.954041
0.958502

Given these random numbers, and a grasp of how the sampling algorithm works, we can estimate the results of the sampling algorithm based on these numbers.* For example given an array {0.5, 0.3, 0.2}, SampleMultinomial should convert all random numbers [0.0,0.5] into “0,” numbers (0.5,0.8] into “1,” and numbers (0.8,1.0] into “2.” Using this fact, we manually create an array of answers that corresponds to our random seeds, and create a test function that checks to make sure that these are actually the values that our sampling algorithm produces.

bool TestPseudoRandom(const vector<double> & distribution) {
    srand( 1234 );
    const int C = 10;
    int expected[C] = { 0, 0, 0, 0, 1, 0, 0, 0, 2, 2 };
    bool passed = true;
    for(int i = 0; i < C; i++) {
        int x = SampleMultinomial(distribution);
        if(x != expected[i]) {
            cerr << "FAILED: Actual value at " << i << " (" << x
                 << ") not equal to expected (" << expected[i] << ")" << endl;
            passed = false;
        }
    }
    return passed;
}

If you’d like to try this yourself, I’ve made the full code available here.

*One thing we need to be careful here is that this is also relying on the fact that the random number generator will generate the same numbers given a particular seed, which may not be true for some languages or operating systems.

Posted in Uncategorized | 2 Comments

Most cited papers of the ACL (1990-2009)

I was wondering what papers had the most influence on the field of computational linguistics and natural language processing, so I tried making a list of the most cited papers from the annual meeting of the ACL (Association for Computational Linguistics) for the past twenty years. All citation counts are from Google scholar as of today, so take them with a grain of salt, but here goes:

Posted in Uncategorized | 2 Comments

How do children learn language (and how can we mimic them?)

Recently I’ve been reading a bit about Psycholinguistics, the two books below for example:



Introduction to Psycholinguistics was quite good, with a lot of useful and eye-opening information. The computational modeling book is also rather good (I’m only halfway through), but maybe a bit thin for the money at only 70-some pages.

How do children learn language?

The thing that struck me the most when reading both books was that it seems that a certain pattern pervades children’s language learning, be it the learning of phonemes, word/concept mappings, or morphology. This is the fact that children first under-generalize, then over-generalize, then get it about right.

What does this mean? Let’s take the example of a child that wants to learn what the word “rose” means. In the under-generalization stage, the child may be able to appropriately identify red roses, but not identify white roses when they are told “look at the rose.” In the over-generalization stage, they will identify both of the roses, and also tulips as a “rose.” Finally, as they grow older their idea of roses will come into line with adults, and they will be able to properly identify both colors of roses, but realize that the tulip is something different.

  Early Middle Adult
Red Rose Yes Yes Yes
White Rose No Yes Yes
Tulip No Yes No

This pattern is most pronounced in morphology. At first kids will be able to identify only a small amount of both regular and irregular verbs, probably remembering each word exactly as they hear it. After that, they will learn how to conjugate, but mistakenly conjugate irregular verbs using regular conjugations (like “goed”). And finally they will (mostly) converge on the actual usage of verbs in the language that they are learning.

For phoneme learning, I’m not sure if the over-generalization phase occurs, but apparently young babies are able to recognize the difference between phonemes, even if they are not distinct in their native language. For example, Japanese babies can recognize the difference between “l” and “r” easily, while Japanese adults cannot.

What can we do to mimic them?

So the question is, what does this mean when we are building computer programs to try to do unsupervised language acquisition, particularly in a statistical framework? Well, it couldn’t hurt to try to mimic this learning style. After all, it works for humans, right?

For learning phonemes, this could indicate that we want to start with an acoustic model with a large number of HMM states or Gaussian mixtures. This model would over-fit the data heavily, but as we get more evidence, some of the Gaussian clusters will start to run together, so we set a threshold based on something like KL divergence, and merge clusters together that go under the threshold. Or it’s possible that we could formulate something like this with non-parametric Bayesian statistics.

For learning morphology, it seems clear that we need some sort of device that first checks whether a word is irregular, then falls back to generating it from regular patterns if it is not. In the early learning phase, the regular patterns have not been learned yet, so unless we’ve seen the word already we’re out of luck. In the later learning phase, the back-off probability goes too high, so we generate from standard patterns too often. Finally, the back-off and generation of irregulars from memory becomes balanced correctly. This also seems like something that non-parametric Bayes would be good at, particularly with a hierarchical model. The irregular verbs could be generated from a Dirichlet process, with the regular patterns as its base measure.

For the concept mapping, this is a little bit more difficult, but it seems a bit like the phoneme learning problem. We could map, say images, into a continuous feature space and model this with a Gaussian mixture model, and it would be reduced to the same problem.

This, of course, is not my specialty, so I’d love to hear more comments, especially from people who have done work on something like this in the past.

Posted in Uncategorized | Leave a comment

Practical Bayes: Variational vs. Sampling

Recently Bayesian methods (particularly non-parametric Bayesian methods) have been widely used in unsupervised learning for NLP, as they provide a principled way to balance model complexity and expressiveness.

There are two major techniques for learning Bayesian models: Variational Bayes (VB), and Gibbs Sampling. For more details on the methods themselves you can reference the linked tutorials. The point of this post is to compare the two methods from a practical standpoint, and help you decide which may be better suited for use in application development or modeling research.

Speed

Winner: Variational Bayes

In terms of speed of convergence to an answer, variational Bayes is the winner hands-down. It will generally converge to a reasonable answer in several 10s of iterations, while sampling can take hundreds or thousands if things don’t go well. This is because in each iteration, Variational Bayes considers fractional counts representing multiple possibilities over the entire corpus, while sampling chooses only a single possible value locally for each variable. However, there have been a number of advances in sampling speed in recent years, such as block sampling algorithms (for example, for word segmentation or word alignment) and type-based sampling, so the difference is smaller now than it was a few years ago.

Parallelism

Winner: Variational Bayes (Maybe)

The most time-consuming step in Variational Bayes is the estimation of fractional counts for hidden variables in the corpus (for example, in POS estimation, estimating the POS configurations for every sentence). This can be trivially parallelized by splitting the corpus into segments, estimating the counts for each segment on parallel processors, then re-merging the counts. For sampling, it is not (yet?) possible to parallelize in a principled manner, although there have been reports that even it is theoretically incorrect, parallel sampling does not cause significant drops in accuracy.

Stability

Winner: Variational Bayes (Maybe)

Gibbs sampling is a randomized algorithm, and by nature randomized algorithms will give us different results every time. However, even for variational Bayes, which is a deterministic algorithm, there is a necessity to initialize the hidden variables and this is often done by either inserting a small amount of random noise, or using a randomized algorithm such as k-means. As a result, both are not 100% stable, but my instinct is that Variational Bayes will return more consistent results.

Ease of Implementation

Winner: Toss-Up

The actual algorithms in Gibbs sampling are often quite simple to implement, essentially boiling down to “remove variable y_i, sample new value of variable y_i, add variable y_i.” In contrast, for VB, it is often necessary to consider interactions between variables in, say, the same sentence, which often requires the use of the forward-backward algorithm or other dynamic-programming style techniques (these can also be used with sampling to improve efficiency but are not necessary). Thus, the pure amount of code needed to implement sampling is often less than that for VB.

However, when implementing VB, we are guaranteed to see increases in the lower bound at every step, and thus if the numbers start decreasing we know there is a bug in our program, making it relatively easy to debug. Sampling, on the other hand, is more difficult to debug, as it is theoretically possible for the likelihood to decrease at any iteration (although if the likelihood continues to decrease for several iterations it is likely that you have a bug).

Accuracy

Winner: Gibbs Sampling (Maybe)

VB, like EM, is an approximation technique, and will converge to a non-optimal solution if it is not initialized properly. On the other hand, sampling is guaranteed to converge to the true distribution if it is given enough time, and often achieves greater accuracy when given enough time to run in practical situations (especially when using more advanced sampling techniques such as type-based sampling).

Memory for Sparse Problems

Winner: Gibbs Sampling

When using methods such as non-parametric Bayes that essentially allow for an infinite number of potential elements in a probability distribution, there is a question of which elements to represent in memory. Taking the example of unsupervised word segmentation, every substring in the corpus is a potential word, and thus should be assigned a probability. In sampling, the answer about which elements to represent in memory is simple: just save every word that occurs in at least one sample, and do not save any words that do not occur in any sample. When combined with a prior that favors sparse distributions, the number of actual saved words will be quite manageable, even for large corpora.

However, in variational methods, even words with fractional counts must be saved in memory, and in the extreme all substrings that occur in the corpus will have counts saved in memory. There are methods for truncating small counts with a cutoff that help manage this problem, but these are further approximations over the already approximate VB model, and are not very effective at early stages of the training process when the distribution has not had enough time to become sparse.

Flexibility

Winner: Gibbs Sampling

Perhaps the largest disadvantage of VB is that it generally relies on conjugate prior distributions, which may or may not actually be the best prior for the model that you are trying to use. For example, while the Dirichlet distribution is widely used as a conjugate prior for the multinomial distribution, the Pitman-Yor distribution has proven a better fit for natural language tasks, and work that I’ve talked about previously in covariance modeling has also used non-conjugate priors that are easy to handle with sampling but difficult (or impossible?) to handle using VB.

Conclusion

Hopefully this will help you decide which method to use for your next implementation of a Bayesian model. If you notice anything that I missed or any errors please leave them in the comments section!

Posted in Bayesian, Machine Learning, Sampling, Variational Bayes | 3 Comments

Modeling Covariance in Topic Models

Warning: this post is relatively high level, and assumes that you have some (but not too much!) knowledge of topic models, specifically Latent Dirichlet Allocation (LDA). However, while it is high level, I am excited by the idea and think it will be quite useful in actual ML-based natural language applications. So here we go.

Topic Models and Independence

The traditional LDA-based topic model is a generative model for a collection of documents. We assume that the collection of documents was generated in the following fashion.

  • For each of K topics, we generate a probability distribution over words given the topic. For example, if topic 4 corresponds to “sports,” the probability of the word “baseball” will be relatively high given topic 4. This probability distribution is given a Dirichlet prior.
  • For each of M documents, we generate a probability distribution over topics given that document. For example, if document 3 is about bribery in baseball, the probabilities of topics representing “sports” and “crime” may be high given document 3. These distributions are also given a Dirichlet prior.
  • For each of N words in a particular document, we first choose the topic that the word belongs to from the document’s topic distribution, then choose the word according to the topic’s word distribution.

If we appropriately estimate parameters for each of these probabilities, we can find a number of pieces of different information such as grouping of words that correspond to topics, which topics are prevalent in certain documents, etc.

However, it is also obvious that we are making quite a few unrealistic independence assumptions here. The one that is most obvious is the bag-of-words assumption, that each word is in a document is generated independently of the surrounding words (given the topics), and ways to resolve this assumption have been the subject of a fair variety of research. But, there are also some less obvious independence assumptions that nonetheless have the potential to significantly degrade the accuracy of the model.

In particular, it should be noted that by assuming that the topic distribution for each document is generated independently from a Dirichlet distribution, we are disregarding the fact that some topics tend to co-occur more than others. For example, “technology” and “stock price,” or “politics” and “crime.” No why is this a problem? If we assume that the topic distribution for each document is generated independently, this co-occurrence between topics will be accounted for within each topics word distribution, and we will start getting words like “iPod” showing up in the “stock price” topic, and “fraud” occurring in the “politics” topic (we will skip the debate about the correctness of the second decision). Recently, I’ve read about two interesting methods about how to fix this problem.

Pachinko Allocation

The first, a method called Pachinko Allocation, expands the idea of Latent Dirichlet Allocation to arbitrary non-cyclic graphical models. Particularly, in the so-called 4-layer Pachinko allocation model, they split the two topic layers into “coarse topics” and “fine topics.”

Fine topics are similar to normal topics in traditional LDA model, while coarse topics are essentially groupings of topics that often co-occur. For example, we may have a coarse “sports” topic that gives a high probability to the finer topics of “Basketball,” “Soccer,” or others.

Logistic Normal Distribution

While Pachinko allocation is something of a hot topic recently, an alternative, possibly simpler approach to modeling this covariance is described here in the context of unsupervised dependency parsing. Here, instead of adding multiple layers and a larger number of multinomial distributions to the model, we simply change the prior on the documents’ topic distributions from a Dirichlet distribution to a different distribution called the Logistic Normal Distribution (LND).

The basic idea behind the LND is that the prior on probabilities follows a normal distribution. Let’s say we have topics “stock,” “tech,” “politics,” and “crime” from the previous example. If we assume that the probabilities of all the topics are independent (like we do when we use the Dirichlet distribution), the covariance matrix of the normal distribution would be diagonal, resulting in a distribution over the probabilities of “stock” and “tech” to be independent. However, if we allow the probabilities of topics to co-vary, we are able to express the fact that “stock” and “tech” tend to be either both high or both low:

Now as an observant reader, you may have noticed “wait, we’re trying to model probabilities, but there’s no guarantee that the normal distribution will add up to one!” That’s where the “logistic” part of the name comes in. We take the values of each dimension generated by the normal distribution and use the softmax function, which guarantees that they will add to one.

Why is this Important?

So you may be wondering why I would go out of my way to write about this. Well the answer lies in the paper about using the logistic normal distribution in unsupervised dependency parsing. Let’s say we have similar part of speech tags such as present-tense and past-tense verbs. These two tags will tend to have similar, but slightly different children, and the paper shows that by capturing covariance, parsing results can be improved. This same kind of approach could be expanded to any number of areas, unsupervised POS tagging being the most obvious. It will be interesting to see what kind of applications people come up with for these sorts of approaches.

Posted in Bayesian, Machine Learning | 4 Comments

What Goes into a Machine Translation Baseline?

Machine translation (MT) is a complicated task, if it wasn’t we would have solved it already. However researchers have been chipping away at this difficult task one small bit at a time, and MT has finally gotten good enough that it can be understood to some extent for many language pairs.

The way much (academic) machine translation research is done is:

  1. Think of some clever improvement to the way existing things are done.
  2. Train a machine translation system using both the baseline method and your new proposed method.
  3. Do an experiment showing that your proposed improvement is the best thing since sliced bread (in other words, that it outperforms the baseline).

The most common baseline in used by most research is Moses, a decoder for phrase-based statistical machine translation (paper). Moses comes with not only a decoder, but also a standard training regimen that allows you to build your own translation system (with only a moderate amount of pain) from a parallel corpus.

However, unless you really look inside the training process it’s hard to appreciate just how much research went into creating this system that is now the baseline that MT systems are trying to beat. This post is an attempt to help people that are not familiar with Moses’s full training process to understand just how much technology went in to the system that MT researchers are now attempting to beat. Note that this is the training process with the default settings, and doesn’t use any of Moses’s more advanced features like factored translation models, trees, etc.

  1. Find a parallel corpus and do preprocessing such as tokenization and lowercasing.
  2. Automatically create bilingual classes (paper) using the mkcls program. These are used to help smooth word alignment probabilities.
  3. Run GIZA++, an implementation of the classic IBM alignment models (paper), with a number of improvements such as class-based smoothing (paper). This is done to create alignments in both directions from source to target and target to source.
  4. Use heuristics to combine bi-directional alignments (paper) into a single alignment.
  5. For each pair of words e and f, calculate lexical translation probabilities P(e|f) and P(f|e).
  6. Extract phrases (paper, p. 54) that are consistent with the alignments acquired in step 4. The size of phrases are limited by length, a trade-off between table size and accuracy (paper).
  7. Score each phrase with a total of five feature functions.
    1. The conditional probability P(f|e), as specified by the noisy-channel model framework.
    2. The reverse conditional probability for each phrase P(e|f), which helps accuracy (paper).
    3. Lexical translation probabilities in both directions (paper), which help determine which phrases are reliable and which are not.
    4. A constant phrase penalty, which can be used to help favor longer or shorter phrases.

    These features (and the following lexical reordering probabilities) are quite important, but easily overlooked.

  8. Calculate lexical reordering probabilities (paper) that are used to indicate which phrases tend to accompany reordering.
  9. Create a smoothed n-gram language model (paper) using a toolkit such as SRILM.
  10. Perform minimum error rate training (paper) by performing iterations that translate some held-out data, then adjust the model parameters to minimize an evaluation measure such as BLEU score (paper).
  11. With the learned models and parameters, run decoding on the test data.

For those of you keeping track, that’s a total of 13 different papers. And that’s only to make the models needed, without even talking about the difficult search problems that go into decoding! But that’s just a testament to what a monumental problem machine translation is, and how all of the small steps that people all over the world are making every year are bringing us closer to our final goal of breaking down the language barrier once and for all.

Posted in Machine Translation | 3 Comments

Sampling: Random Order? Corpus Order?

Gibbs sampling is a common technique that is used in Bayesian learning that is used to find the true distribution of some distribution over probabilistic variables that we cannot calculate directly. I won’t cover the details here, but Wikipedia or Pattern Recognition and Machine Learning give good introductions.

But the important thing is, Gibbs sampling can be used to sample multi-variable distributions one variable at a time. For example, let’s say we have an HMM based unsupervised part-of-speech tagger, and we want to sample variables y1 through y5 that correspond to the configuration of parts of speech of the following sentence:


We can do this by first sampling the value of the variables for “time,” then for “flies,” “like,” “an,” and “arrow” in order. However, we can actually sample these values in any order, for example, first sample “an,” then “like,” “time,” “arrow,” “flies.” In fact, regardless of the order that we sample in, as long as well sample all the variables in each pass through the corpus, we are guaranteed that given infinite samples we will eventually be able to simulate the true distribution.

In real life however, we aren’t able to take an infinite number of samples, so it is important that we are able to reach the real distribution sooner rather than later. After I uploaded my my tutorial on Bayesian non-parametrics, Daichi Mochihashi was kind enough to point out that I should mention that it is better to take samples in random order, as opposed to taking them in corpus order. I wanted to see for myself exactly how important this is, so I decided to do some experiments to see whether randomizing the sampling order results in faster convergence. The following two experiments show the likelihood and accuracy for unsupervised POS tagging and word segmentation in Japanese (the models described in my tutorial). All graphs are the average of eight independent runs of the sampler.

It can be seen that, perhaps as expected, random sampling does better than ordered sampling as we increase the number of samples, although more so for word segmentation than POS estimation. So apparently it’s a good idea to randomize the order of our sampling. One other interesting observation is that while in  word segmentation the distribution saturates relatively quickly, simple Gibbs sampling for the HMM continues to increase approximately linearly for 1000 iterations.

If anyone has any other experiences in this area please feel free to comment!

Posted in Bayesian, Machine Learning, Sampling | 4 Comments

SVM, Logistic Regression, and Precision

On twitter a little while ago, @mathieuen asked an interesting question about whether the fact that he was seeing support vector machines (SVM) get better precision than logistic regression (LR) was a result of the learning algorithm (specifically the hinge loss function). I answered off the top of my head that I thought that that would be true in the case where there were more negative examples than positive examples (which is generally true in many situations), but it turns out that I was actually incorrect and the reverse is actually true, in simple cases at least.

To give some background, both SVMs and LR are methods that can be used for binary classification. If we have two classes such as the dots in the below graph, we want to find an appropriate line that divides the two.

The problem is that, as long as the data are separable by a line, there are often any number of lines that can be used to separate the data. For example, any of the lines in the below figure would correctly separate the two types of dots into classes.

SVMs and LR both give a criteria to decide which line is the best line to use. SVMs are max margin learning methods, which means that they attempt to find the line that is right in the middle of the two points that is closest to the other side. LR on the other hand, assumes that each value is given a probability of true or false according to the sigmoid function, and then attempts to maximize the probability of the actual data. Actually, if we have a relatively balanced number of “true” and “false” data points, SVMs and LR will often return similar solutions (or exactly the same solution in the extreme case of two data points below).

However, when the classes are more unbalanced the story is a bit different. While SVM continues to pick a point directly in between the two closest points, LR’s decision boundary will be pushed away from the larger class, which ends up in an increase of the likelihood according to the sigmoid function.

So what does that mean for precision and recall? Well, if the decision plane is closer to the true points, that means that fewer (and only more certain) points will be classified as true. This results in LR’s precision being higher than that of SVMs. Of course this is a very simple case, not considering multiple dimensions, regularization, or tuning hyperparameters, but the general idea that maximum margin training or hinge loss will directly result in increased precision when there are more falses than trues is, in fact, completely the opposite.

By the way, I’ve uploaded the script I used to make the graphs, feel free to use it if you’d like.

Posted in Discriminative Models, Machine Learning | 2 Comments

The Importance of Error Analysis

There was a discussion on Twitter yesterday started by a tweet by @mamoruk (in Japanese). Basically, the point is that error analysis in NLP is too broad-focused, and many people don’t even look at the data, just looking to see if their accuracy, BLEU, or any other sort of evaluation measure goes up. As someone who has been guilty of “only looking at the numbers” in the not so recent past, I feel like I need to preach the good word of proper, systematic error analysis, share a few ideas, and maybe get some from others as well.

Of course, the traditional method of error analysis in NLP is generally to output a file containing the differences between the reference and system output, eyeball it, and get an idea of what’s going wrong. For example, in Japanese morphological analysis we could get a file with lines like this:

駅伝/N V/S 2/N の/P 立/V 役者/N
駅伝/N V/S 2/N の/P 立役者/V
e e e e s d

We can see that a word segmentation error has caused two words in the reference (top line) to be combined into one in the output. We can search for parts that are not equal and notice a variety of errors.

However, this doesn’t give us a good idea of the general trends of what kind of errors are occurring, particularly which errors are most common, or whether we are having a lot of common errors, or if there are a wide variety of uncommon errors. So instead, let’s try counting which errors occur most often (using this script):

で/AUX で/P 40
に/AUX に/P 20
で/P で/AUX 18
に/P に/AUX 4
kg/SUF kg/N 4
突き止め/V 突き止め/N 3
さらに/CONJ さらに/ADV 3
が/V が/P 3
とも/SUF と/P も/P 3
あ/V る/END ある/ADN 3

This gives us a much more clear idea of what is going wrong. We can see the largest cause of errors is that the model is having trouble distinguishing between auxiliary verbs and particles for “で” and “に”, and that there are a number of other common segmentation errors, etc. It is also useful to look at the less common errors to get a grasp on the prevalence of unknown words, etc.

Of course, there are people doing this with much more sophisticated techniques. Two recent papers that I particularly like in this area are Goldwater et al.Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates” (SPECOM 2009), and Steinbiss et al. “Direct Observation of Pruning Errors (DOPE): A Search Analysis Tool” (InterSpeech 2010). Goldwater’s work uses a logistic regression model to model the factors that contribute to speech recognition errors, while Steinbiss’s work allows you to visualize exactly where pruning errors are made during speech recognition decoding.

Posted in Error Analysis | Leave a comment