29 January 2006

Predictive Models

There are at least two reasonable goals for NLP research: (1) produce predictive systems that mimic humans; (2) produce explanatory models. (2) gets more into CL and CogSci and I may discuss it at a later date. This post is about (1), at least the machine learning approach to (1).

Our goal is to mimic a human on some task. This means we will, at the end of the day, want to measure our performance against human performance. This measurement is our loss function, and is, in my mind, the only thing that really matters. Coming up with a good loss function is really really hard and I will certainly discuss this issue in another post. But suppose we can come up with one: l(t,h) = how much it hurts to produce a hypothesis h when the true answer was t.

The subsequent question is: given some collected data, how can I minimize l. The dominant approach seems to be: approximate l by some well-known loss (binary loss, squared error, absolute loss, etc.) and then use one of the many good classifiers/regressors to solve our original problem.

This step irks me greatly. Not because this approximation is always bad, but because it is never (rarely) stated. Many papers say something like "Our task in XYZ is to minimize misclassification error of problems created by ABC." This is rarely actually true. I would be much happier with the following explanation: "Our task in XYZ is to minimize TUV loss, but we don't know how to do that, so we'll instead minimize misclassification error." This issue manifests itself commonly in another way: putting the cart before the horse. We often gather a data set, label it, solve the labeling problem (eg., by multiclass classification) and then go back and define the problem in such a way that this is an appropriate solution. This is harder to detect, but equally disingenuous.

So let's say we've done things right: we've defined our loss function and we've gotten some labeled data. Now we have to minimize our loss on this data. If we can do this directly, great. If not, we can say so and that we'll approximate it by some other loss l'.

Now we can start looking at features. We have an l' that we can minimize and we want to come up with features that are useful for this minimization. In particular, if ours is a structured problem, and l' is invariant to some change in structure, we needn't reflect this aspect of the structure in our feature functions function (the prototypical example here being Markov features in sequence labeling for Hamming [per-label] loss, though this issue itself could take a whole other post). We should set aside part of the training data for manual inspection to do feature engineering.

Experimentation is the last real step. We've split off a training set, development set and test set (*). We train a model on the training set, evaluate it on the development set. We look at the dev predictions and see what went wrong. We reengineer some features to fix these problems and repeat. Only at the end do we run on the test set and report our results in a paper.

(*) I'm not a fan of cross-validation in such a cyclic process. The features will be engineered based on the training and development data sets, and to then fold all this back into the bag and do cross-validation will give us unrealistically good results.

And as a very final step, in the case that our optimized l' is not the same as l, I would like to see some results that show that they at least correlate reasonably well. And easy way to do this is to run several versions of the model and to report both l and l' scores and let us see that improving l' does indeed improve l.

25 January 2006

Domain Adaptation

We NLPers face this problem all the time: we have training data from one domain/genre but really want to work in another. Eg., the treebank is WSJ text but we care about email or web pages or whatever. We'd like to be able to intelligently use our annotated WSJ text to get a good statistical model for a different domain.

I've been working on this problem for a while now and have a partial solution. I'm most interested in the case that we have lots of annotated "out of domain" (OOD) data and a little annotated "in domain" (ID) data. Domain Adaptation for Statistical Classifiers is a paper that's been accepted to JAIR that presents one way to model this problem. The key idea is to model the OOD/ID data distributions as mixtures. There are three mixture components: a "truly ID" distribution, a "truly OOD" distribution and a "general" distribution. We say the OOD data comes from a mixture of "truly OOD" and "general," while the ID data comes from a mixture of "truly ID" and "general." The learning task is to tear apart our data sets to figure out what "general" (and hence relevant to the ID task) information there is in the OOD data.

The framework, when applied to maximum entropy models, gives relatively simple update equations for model parameters. The derivations take a bit of thought, but are not insane. The approach is a partial solution because it's limited to maximum entropy models. I think the problem should be amenable to a more learning theoretic analysis, but haven't had time to make much headway here.

I'm a bit surprised this problem hasn't gotten more attention in the NLP community (a similar problem -- speaker adaptation -- exists in the speech community). Or perhaps I've missed it in NLP. It seems like we, as NLPers, should really care about this issue.

23 January 2006

Most Influential NLP Papers

I conducted a mini survey recently, asking people I knew what they thought were the most influential papers in NLP from the past two decades. Here are the wholly unscientific results, sorted from most votes and subsorted by author. Note that I only got responses from 7 people. I've not listed papers that got only one vote and have not included my personal votes.

(7 votes): Brown et al., 1993; The Mathematics of Statistical Machine Translation
(5 votes): Collins, 1997; Three Generative, Lexicalised Models for Statistical Parsing
(4 votes): Marcus, 1993 Building a large annotated corpus of English: the Penn Treebank
(3 votes): Berger et al., 1996; A maximum entropy approach to natural language processing
(2 votes): Bikel et al., 1997; An Algorithm that Learns What's in a Name
(2 votes): Collins, 2002; Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms
(2 votes): Lafferty et al., 2001; Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
(2 votes): Och, 2003; Minimum Error Rate Training for Statistical Machine Translation
(2 votes): Papineni et al., 2001; Bleu: a method for automatic evaluation of machine translation
(2 votes): Ratnaparkhi, 1999; Learning to Parse Natural Language with Maximum Entropy Models
(2 votes): Yarowsky, 1995; Unsupervised Word Sense Disambiguation Rivaling Supervised Methods

This seems to be dominated by parsing and techology/machine learning papers, with a smattering of MT thrown in for good measure. I feel that the former represent a sort of "lowest common demoninator." There were several singleton votes for various papers in summarization, language modeling, IR, and other fields, but, with the exception of the MT papers, getting >1 vote meant that you had to be useful to more than just your one community. It's also a bit surprising that Collins' parser is so high, but no other parsers got any votes (Charniak, Lin, etc.).

Feel free to say what you think is missing!

19 January 2006

Structured Prediction 2: My Perspective

In Structured Prediction 1: What's out there?, I discussed current SP technology: Markov fields, CRFs, M3Ns and SVMISOs. Here I'll discuss my work briefly. I'll come back to SP once more to discuss what seems to be on the horizon.

With so many options, why bother with a new one? There are problems that are not solved by the standard approaches. The biggest problem is the "argmax assumption," which states that you can solve the following problem:

arg max_{y in Y} f(x,y; w) = arg max_{y in Y} w*Phi(x,y)

Where the equality is under the standard linear predictor assumption. This maximization requires that, for an input x, we are able to compute the highest scoring possible output (efficiently).

The problem is that for many (most?) NLP problems, without making completely unreasonable assumptions, this maximization is intractable. The standard assumptions that make this operation polynomial time are the Markov assumption for sequences and the context-free assumption for trees. But we know that both of these assumptions are invalid, and a few papers have tried to deal with this by using sampling, reranking or approximate inference techniques.

On the other hand, if you look at what people in NLP actually do to solve problems, they use approximate search (MT is a good example). In this way, the search space can be tuned to precisely capture the functional (feature) dependencies needed.

My perspective is that if we're going to use search anyway, we should take this into account while learning. There is no point to learning a model that ranks the best output highest if we cannot actually find this output. And once we start using approximate search, any performance guarantees related to learning pretty much go out the window.

The really cool thing is that a completely different brand of machine learning people have considered learning in the context of search, but they call it reinforcement learning. The key difference between SP and RL, as I see it, is that in RL you can affect the world whereas in SP you cannot. In the standard "action/state/observation" terminology, in SP you only get one observation at the very beginning (the input x), while in RL the observations keep coming in as you take more actions. This seems to be the fundamental difference between the two problems.

So, by treating SP in a search framework, what we basically want is a function that takes our input x, a partial output represented by a state in our search space s, and predicts the best action to take: the best thing to add onto the partial output. For instance, in sequence labeling with left-to-right decoding, we might predict the label of the next word. In MT, we might predict a French phrase the translate and its English translation and append this to a translation prefix. The action set is potentially large, but not impossibly so (if it were, the problem would be intractable anyway).

The great thing is that this enables us to formally cast structured prediction as a cost-sensitive multiclass classification problem. Lots of algorithms have been developed for the multiclass problem, and we can seek to apply these to the SP problem through this reduction. This gives us simple, efficient algorithms with good practical performance and strong theoretical guarantees. The key requirement that makes all this work (essentially my version of the arg max assumption) is that for a given input x, true output y and state s in the search space, we can compute the action a that is optimal to execute from s. This is the "optimal policy assumption." (We can actually weaken this assumption to require only a bound on the optimal policy, which is necessary for optimizing things like BLEU or ROUGE, at least under interesting models.) It is provably a strictly weaker assumption than the assumptions made by CRFs and M3Ns, essentially because it doesn't care about the feature space, which means we can have whatever the heck features we want (a good thing for NLP).

17 January 2006

The MT Blues: Would a program get its facts right?

I just came across an article on Slate entitled The Translator's Blues: Will I get replaced by a computer program?, written by Jesse Browner, a human translator. I've no idea how wide-read Slate is, and I think Browner is correct that his job is not (yet) in jeopardy, there are several common misconceptions in the article that really shouldn't be there.

First, the results of his example sentence are atrocious. I see no reason why even the rule-based systems should get this wrong (though I suspect his poor results had to do largely with the difficulty of inputting accented characters). Language Weaver stood out, which is good, because it's a good system. In fact, in all the news texts he tried, it is reported to fare well.

And he's right that the technology has limits: it could not translate the first sentence of Don Quixote well. This is the domain adaptation problem: Language Weaver's system was trained on news data and (surprise surprise) it performs well on news and not on fiction. I think the important point is that the problem is domain. IMO, it has nothing to do with context, which is what Browner attributes the problem to. The con/pen example is something that I believe a decent statistical MT system should and would get right; if not now, in a few years once language models are better. It's unfortunate that Browner doesn't say that Language Weaver is (largely) a statistical systems, because he seems to say that they suck, yet LW was by far the best system he tried.

My final comment is in reference to the attribution to Mike Collins, who clearly knows much more about this field than many. Most likely Mike was quoted out of context, but Language Weaver's system -- and, in fact, most good research systems out there -- are based on machine learning. That's what "statistical" means.

(Incidentally, for those less literary minded, the first sentence of Don Quixote should be translated something like "In a village of La Mancha, the name of which I have no desire to call to mind, there lived not long since one of those gentlemen that keep a lance in the lance-rack, an old buckler, a lean hack, and a greyhound for coursing." I'm not convinced that a professional translator who worked for the UN would find this easy to translate either.)

15 January 2006

NLP as Glorified Memorization

The view of NLP as essentially a hunt-and-return technology has been gathering momentum since the burgeoning of the web. Example-based MT takes this view to machine translation, and phrase-based statisitcal MT is essentially EBMT done with statistics. In question answering (factoid style), the situation is even more dramatic. Deepak's thesis was essentially devoted to the idea that the answer to any question can be found in huge corpora by relatively simple pattern-matching. To a somewhat lesser degree, information extraction technology is something like smoothed (or backed-off) memorization, and performance is largely driving by one's ability to obtain gazeteers relevant to one's task.

Pushing such memorization technology further will doubtless lead to continued success, and there are many open research questions here. I would love others to answer these questions, but I have little interest in answering them myself. Fortunately, I think there are many interesting real-world problems for which simple memorization techniques will not work, and deeper "analysis" or "understanding" is required.

Any QA/summarization task that focuses on something other than "general world knowledge" fits into this category. I might want to ask questions to my email client about past emails I've recieved. The answer will likely exist only once, and likely not in the form I ask the question. I might want to ask questions about scientific research, either from PubMed or REXA. I might want to ask about the issues involved in the election of the Canadian PM (something I know nothing about) or the confirmation hearings of Samuel Alito (something I know comparatively more about). And I would want the answers tailored to me. If I owned a large corporation or were running a campaign, I would want to know what my supporters and detractors were saying about me, and who was listening to whom.

I could be proven wrong: maybe memorization techniques can solve some/all of these problems, but I doubt it. What other problems are people interested in that may not be solvable with memorization?

12 January 2006

Long Distance Collaboration

For those who have engaged in such an affair, I'm curious how you make it work well. I've been working with John Langford recently, drawing out and exploiting connections between structured prediction and reinforcement learning. I'm visiting family in Chicago now and will be going to TTI tomorrow to visit. Up until this visit, we've talked mostly over email, less over phone and even less in person (some at ICML and NIPS 2005). But the rate of information transfer in such collaboration is significantly less than if John had an office down the hall from me. How do people make this work?

  1. Site visits. This is/was the defacto standard in business for ages, before the internet came. People also tended to collaborate less and in business there is perhaps less of a push for collaboration rather than subcontracting. While undoubtedly the best method for interacting, there is significant overhead in terms of cost, travel time, etc., especially in different continents.
  2. Telephone. Another good business standard; unfortunately, I'm really a white-board kind of guy, and it's really hard to draw on someone else's white board. Talking about math and referencing papers is also difficult over the phone. Large time differences can also be a nuisance (I recall talking to Yee Whye at about midnight my time because of the time difference to Singapore).
  3. E-mail. This seems to be what is used most: it is cheap and doesn't suffer from the time difference, though replies are delayed. It enables you to think more about a topic before replying, but cuts down on often-constructive back-and-forth discussion (since that would take too long). The medium is also limited: typing math into email is ugly for anything complex, and you still can't draw on someone else's white board. Twice in my life I've attached .eps figures, but this is highly ineffective.
So, what do other people do? Is there any hope? It seems that email could theoretically be retrofitted to fix a lot of the problems I have with it (multiple media, drawing, math, etc.), but does such technology exist currently?

09 January 2006

The Mad Paper Rush

The HLT/NAACL deadline passed last month and the ACL deadline is next month. The cluster here seems more swamped than usual, people are rushing to get experiments done and papers written. It seems that most research gets done two months prior to the due dates and papers are rarely, if ever, submitted early. (Actually, I saw a statistic from ICML that most early-submitted papers actually got rejected in the end!) As someone who is trying to organize a ski trip to Mammoth two weeks before the ACL deadline and having difficulty getting people to come, I wonder why there's always this last minute rush. The ACL deadline was published at least four months ago, and even if not, it's actually later this year than usual.

The only explanation that makes sense to me is that the rush is to run a handful of well-controlled experiments based on continuous tweaking of a system. The further you can delay running these experiments, the more you can tweak. Importantly, assuming you're a good researcher, the experiments are not to convince yourself that your approach is viable. The experiments are to convince your potential reviewers that your approach is viable.

Based on this, there seem to be two ways to cut down on this rush (assuming people don't like it). First is to reduce the amount of tweaking done. Second is to reduce the number of well-controlled experiments run. Unfortunately, both of these will probably result in lower chances of your paper being accepted (confirming the above-cited ICML statistic). But this is bad. Tweaking will improve your scores (hopefully), but will rarely be mentioned in the paper, leading to irreproducible results. Running too many experiments can cloud the point of your paper and significantly cut down on your time to work on real things.

Despite this, people continue to tweak and run too many experiments (myself included). This seems to be because the cost of failure is too high: if your paper is rejected, you basically have to wait another year to resubmit, so you want to cover all bases. Two solutions come to mind. First, we could spread our conferences further apart, meaning that you have only to wait 6 months. Second, we could try to ensure that reviewers know that with a limit of 8 pages, one cannot run all possible experiments, and that they can suggest alternative contrastive experiments to be run, but if the experiments back up the main point(s) of the paper, this should be sufficient. I don't think the "review/response" approach will fix this because running a new experiment is not what author responses are for, and this is just delaying the problem, rather than fixing it (though I do advocate the response phase).

06 January 2006

Structured Prediction 1: What's Out There

Structured prediction (SP) -- the machine learning task of generating outputs with complex internal structure -- is near and dear to my heart. It is also a generalization of many NLP problems (summarization, MT, IE, coref, parsing, tagging, etc.). This is the first part of a multi-part post on structured prediction (first: what's out there; second: what I do; third: where we should go).

The most important aspect of the SP problem (aside from the data) is the loss function: l(y,y') gives us how much it costs us to produce y' when the correct answer is y. Our goal, as always, is to minimize the expected loss over the distribution that generated our training data. The common way of doing this is to posit a large set Y of possible outputs and to try to find a function f(x,y) on an input x and output y that ranks outputs: f(x,y) > f(x,z) if y is a better output than z. Once we've learned such a function, the only thing left is to find a way to search over all y to find the one with highest score. Aside from the parameterization of f (which is almost always taken to be linear the feature set), the only thing left that matters is how we measure one f against another. For notational covenience, let g(x,y) = exp(-f(x,y)). Here are some common ways to score f:

  • Generative: f is good if g(x,y) / sum_{x',y'} g(x',y') is high for all (x,y) in the training data
  • Conditional: f is good if g(x,y) / sum_{y'} g(x,y') is high for all (x,y) in the training data
  • Discriminative-M: f is good if f(x,y) > f(x,y') + l(y,y') for all (x,y) in the training data and all y', where errors are punished equally
  • Discriminative-S: f is good if f(x,y) > f(x,y') for all (x,y) in the training data and all y', where errors are punished proportional to l(y,y').
Assuming all optimizations are equally difficult (or easy), these techniques should be ranked according to how closely they match our true desire: low expected loss. To me, the Discriminative-S method is the closest. To see the distinction between D-S and D-M, consider the case where we can easily find a function that ranks the correct outputs higher than the incorrect outputs, but cannot distinguish between the incorrect outputs. From the perspective of expected loss, this is sufficient. From the perspective of D-S, this is also sufficient. But it's not good enough for D-M. D-M requires that if some really bad y' exists, then f(x,y) has to be much larger (by l(y,y')) than f(x,y'). I don't see a reason why we should care about this.

Another way of looking at these methods is how we can optimize them. Let's suppose f is just w*h(x,y), where h(x) = sum_r h(x,r) is a decomposition over regions (adjacent tags in sequences, for instance), w is a weight vector and h provides a feature vector. We can compute the gradient of the optimization criterea (using exponentiated gradients for the discriminative methods) for each of the methods on a given input (x,y) as:

sum_{r in y} h(x,r) - sum_{y'} sum_{r' in y'} X h(x,r')

This is just the difference between the true features h(x,r) and the guessed features h(x,r'), where the guesses are weighed by X. The only difference between the above techniques is how we define X.
  • Conditional: X = (1/Z) exp [ sum_{r' in y'} w*h(x,r') ]
  • Discriminative-M: X = (1/Z) exp [ sum_{r' in y'} e C [ (t-1) l(y,r') + w*h(x,r') ] ]
  • Discriminative-S: X = (1/Z) exp [ sum_{r' in y'} e C l(y,r')^(t-1) w*h(x,r') ]
In all cases, Z is defined so that the sum is unity. In the latter two, e is a learning rate, C is the SVM slack hyperparameter and t is the iteration number. The conditional X can be interpreted as the posterior probability of this (incorrect) region under the current model. Going from conditional to D-M involves the insertion of the loss function, leading to a loss-augmented posterior. The (t-1) term essentially means that as more iterations are performed, the loss term becomes increasingly important. In the D-S version, we multiply by the loss rather than add it.

What does this tell us? Well, it tells us exactly what we saw before: the gradient on the D-M model is high when the loss is high, even if the (non-loss-augmented) posterior is zero. This means that even if our model would never produce a wrong region, but that region has high loss, the gradient is still high at that point. This contrasts with the D-S gradient, which, once the posterior disappears, doesn't care about the loss anymore.

The only tricky thing in any of these models is to compute the sum over y' efficiently (or, alternatively, approximate it by computing the max over y'). In well formed cases, like sequence labeling or parsing (under strict assumptions), this is possible using standard dynamic programming (or sometimes, max-flow) methods.

Incidentally, the "conditional" model refers to CRFs, the "discriminative-s" model refers to SVMISOs and the "discriminative-m" model refers to M3Ns.