28 June 2010

ICML 2010 Retrospective

Just got back from Israel for ICML, which was a great experience: I'd wanted to go there for a while and this was a perfect opportunity. I'm very glad I spent some time afterwards out of Haifa, though.

Overall, I saw a lot of really good stuff. The usual caveats apply (I didn't see everything it's a biased sample, blah blah blah). Here are some things that stood out:

Structured Output Learning with Indirect Supervision (M.-W. Chang, V. Srikumar, D. Goldwasser, D. Roth). This was probably one of my favorite papers of the conference, even though I had learned some about the work when I visited UIUC a few months ago. Let's say you're trying to do word alignment, and you have a few labeled examples of alignments. But then you also have a bunch of parallel data. What can you do? You can turn the parallel data into a classification problem: are these two sentences translations of each other. You can pair random sentences to get negative examples. A very clever observation is basically that the weight vector for this binary classifier should point in the same direction as the weight vector for the (latent variable) structured problem! (Basically the binary classifier should say "yes" only when there exists an alignment that renders these good translations.) Tom Dietterich asked a question during Q/A: these binary classification problems seem very hard: is that bad? Ming-Wei reassured him that it wasn't. In thinking about it after the fact, I wonder if it is actually really importantant that they're hard: namely, if they were easy, then you could potentially answer the question without bothering to make up a reasonable alignment. I suspect this might be the case.

A Language-based Approach to Measuring Scholarly Impact (S. Gerrish, D. Blei). The idea here is that without using citation structure, you can model influence in large document collections. The basic idea is that when someone has a new idea, they often introduce new terminology to a field that wasn't there before. The important bit is that they don't change all of science, or even all of ACL: they only change what gets talked about in their particular sub-area (aka topic :P). It was asked during Q/A what would happen if you did use citations, and my guess based on my own small forays in this area is that the two sources would really reinforce eachother. That is, you might regularly cite the original EM even if your paper has almost nothing to do with it. (The example from the talk was then Penn Treebank paper: one that has a bajillion citations, but hasn't lexically affected how people talk about research.)

Hilbert Space Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, A. Smola). This received one of the best paper awards. While I definitely liked this paper, actually what I liked more what that it taught me something from COLT last year that I hadn't known (thanks to Percy Liang for giving me more details on this). That paper was A spectral algorithm for learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang) and basically shows that you can use spectral decomposition techniques to "solve" the HMM problem. You create the matrix of observation pairs (A_ij = how many times did I see observation j follow observation i) and then do some processing and then a spectral decomposition and, voila, you get parameters to an HMM! In the case that the data was actually generated by an HMM, you get good performance and good guarantees. Unfortunately, if the data was not generated by an HMM, then the theory doesn't work and the practice does worse than EM. That's a big downer, since nothing is ever generated by the model we use, but it's a cool direction. At any rate, the current paper basically asks what happens if your observations are drawn from an RKHS, and then does an analysis. (Meta-comment: as was pointed out in the Q/A session, and then later to me privately, this has fairly strong connections to some stuff that's been done in Gaussian Process land recently.)

Forgetting Counts: Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process (N. Bartlett, D. Pfau, F. Wood). This paper shows that if you're building a hierarchical Pitman-Yor language model (think Kneser-Ney smoothing if that makes you feel more comfortable) in an online manner, then you should feel free to throw out entire restaurants as you go through the process. (A restaurant is just the set of counts for a given context.) You do this to maintain a maximum number of restaurants at any given time (it's a fixed memory algorithm). You can do this intelligently (via a heuristic) or just stupidly: pick them at random. Turns out it doesn't matter. The explanation is roughly that if it were important, and you threw it out, you'd see it again and it would get re-added. The chance that something that occurs a lot keeps getting picked to be thrown out is low. There's some connection to using approximate counting for language modeling, but the Bartlett et al. paper is being even stupider than we were being!

Learning efficiently with approximate inference via dual losses (O. Meshi, D. Sontag, T. Jaakkola, A. Globerson). Usually when you train structured models, you alternate between running inference (a maximization to find the most likely output for a given training instance) and running some optimization (a minimization to move your weight vector around to achieve lower loss). The observation here is that by taking the dual of the inference problem, you turn the maximization into a minimization. You now have a dual minimization, which you can solve simultaneously, meaning that when your weights are still crappy, you aren't wasting time finding perfect outputs. Moreover, you can "warm start" your inference for the next round. It's a very nice idea. I have to confess I was a bit disappointed by the experimental results, though: the gains weren't quite what I was hoping. However, most of the graphs they were using weren't very large, so maybe as yo move toward harder problems, the speed-ups will be more obvious.

Deep learning via Hessian-free optimization (J. Martens). Note that I neither saw this presentation nor read the paper (skimmed it!), but I talked with James about this over lunch one day. The "obvious" take away message is that you should read up on your optimization literature, and start using second order methods instead of your silly gradient methods (and don't store that giant Hessian: use efficient matrix-vector products). But the less obvious take away message is that some of the prevailing attitudes about optimizing deep belief networks may be wrong. For those who don't know, the usual deal is to train the networks layer by layer in an auto-encoder fashion, and then at the end apply back-propogation. The party line that I've already heard is that the layer-wise training is very important to getting the network near a "good" local optimum (whatever that means). But if James' story holds out, this seems to not be true: he doesn't do any clever initialization and still find good local optima!

A theoretical analysis of feature pooling in vision algorithms (Y.-L. Boureau, J. Ponce, Y. LeCun). Yes, that's right: a vision paper. Why should you read this paper? Here's the question they're asking: after you do some blah blah blah feature extraction stuff (specifically: Sift features), you get something that looks like a multiset of features (hrm.... sounds familiar). These are often turned into a histogram (basically taking averages) and sometimes just used as a bag: did I see this feature or not. (Sound familiar yet?) The analysis is: why should one of these be better and, in particular, why (in practice) do vision people see multiple regimes. Y-Lan et al. provide a simple, obviously broken, model (that assumes feature independence... okay, this has to sound familiar now) to look at the discriminability of these features (roughly the ration of between-class variances and overall variances) to see how these regimes work out. And they look basically how they do in practice (modulo one "advanced" model, which doesn't quite work out how they had hoped).

Some other papers that I liked, but don't want to write too much about:

Some papers that other people said they liked were:
Hope to see you at ACL!

07 June 2010

NAACL 2010 Retrospective

I just returned from NAACL 2010, which was simultaneously located in my home town of Los Angeles and located nowhere near my home town of Los Angeles. (That's me trying to deride downtown LA as being nothing like real LA.)

Overall I was pleased with the program. I saw a few talks that changed (a bit) how I think about some problems. There were only one or two talks I saw that made me wonder how "that paper" got in, which I think is an acceptable level. Of course I spend a great deal of time not at talks, but no longer feel bad about doing so.

On tutorials day, I saw Hoifung Poon's tutorial on Markov Logic Networks. I think Hoifung did a great job of targetting the tutorial at just the right audience, which probably wasn't exactly me (though I still quite enjoyed it myself). I won't try to describe MLNs, but my very brief summary is "language for compactly expressing complex factor graphs (or CRFs, if you prefer)." That's not exactly right, but I think it's pretty close. You can check back in a few months and see if there are going to be any upcoming "X, Y and Daume, 2011" papers using MLNs :P. At any rate, I think it's a topic worth knowing about, especially if you really just want to get a system up and running quickly. (I'm also interested in trying Andrew McCallum's Factorie system, which, to some degree, trades easy of use for added functionality. But honestly, I don't really have time to just try things these days: students have to do that for me.)

One of my favorite papers of the conference was one that I hadn't even planned to go see! It is Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables by Tetsuji Nakagawa, Kentaro Inui and Sadao Kurohashi. (I saw it basically because by the end of the conference, I was too lazy to switch rooms after the prvious talk.) There are two things I really like about this paper. The first is that the type of sentiment they're going after is really broad. Example sentences included things that I'd love to look up, but apparently were only in the slides... but definitely more than "I love this movie." The example in the paper is "Tylenol prevents cancer," which is a nice positive case.

The basic idea is that some words give you sentiment. For instance, by itself, "cancer" is probably negative. But then some words flip polarity. Like "prevents." Or negation. Or other things. They set up a model based on sentence level annotations with latent variables for the "polarity" words and for the "flipping" words. The flipping words are allowed to flip any sentiment below them in the dependency tree. Cool idea! Of course, I have to nit-pick the paper a bit. It probably would be better to allow arguments/adjuncts to flip polarity, too. Otherwise, negation (which is usually a leaf) will never flip anything. And adjectives/adverbs can't flip either (eg., going from "happy" to "barely happy"). But overall I liked the paper.

A second thing I learned is that XOR problems do exist in real life, which I had previously questioned. The answer came (pretty much unintentionally) from the paper The viability of web-derived polarity lexicons by Leonid Velikovich, Sasha Blair-Goldensohn, Kerry Hannan and Ryan McDonald. I won't talk much about this paper other than to say that if you have 4 billion web pages, you can get some pretty good sentimenty words, if you're careful to not blindly apply graph propagation. But at the end, they throw a meta classifier on the polarity classification task, whose features include things like (1) how many positive terms are in the text, (2) how many negative terms are in the text, (3) how many negations are in the text. Voila! XOR! (Because negation XORs terms.)

I truly enjoyed Owen Rambow's poster on The Simple Truth about Dependency and Phrase Structure Representations: An Opinion Piece. If you're ever taken a class in mathematical logic, it is very easy for me to summarize this paper: parse trees (dependency or phrase structure) are your languge, but unless you have a theory of that language (in the model-theoretic sense) then whatever you do is meaningless. In more lay terms: you can always push symbols around, but unless you tie a semantics to those symbols, you're really not doing anything. Take home message: pay attention to the meaning of your symbols!

In the category of "things everyone should know about", there was Painless unsupervised learning with features by Taylor Berg-Kirkpatrick, Alexandre Bouchard Côté, John DeNero and Dan Klein. The idea is that you can replace your multinomails in an HMM (or other graphical model) with little maxent models. Do EM in this for unsuperviesd learning and you can throw in a bunch of extra features. I would have liked to have seen a comparison against naive Bayes with the extra features, but my prior belief is sufficiently strong that I'm willing to believe that it's helpful. The only sucky thing about this training regime is that training maxent models with (tens of) thousands of classes is pretty painful. Perhaps a reduction like tournaments or SECOC would help bring it down to a log factor.

I didn't see the presentation for From baby steps to leapfrog: How "Less is More" in unsupervised dependency parsing by Valetin Spitkovsky, Hiyan Alshawi and Dan Jurafsky, but I read it. The idea is that you can do better unsupervised dependency parsing by giving your learner progressively harder examples. I really really really tried to get something like this to work for unsearn, but nothing helped and most things hurn. (I only tried adding progressively longer sentences: other ideas, based on conversations with other folks, include looking at vocabulary size, part of speech (eg., human babies learn words in a particular order), etc.) I'm thrilled it actually works.

Again, I didn't see Discriminative Learning over Constrained Latent Representations by Ming-Wei Chang, Dan Goldwasser, Dan Roth and Vivek Srikumar, but I learned about the work when I visited UIUC recently (thanks again for the invitation, Dan R.!). This paper does exactly what you would guess from the title: learns good discriminative models when you have complex latent structures that you know something about a priori.

I usually ask people at the end of conferences what papers they liked. Here are some papers that were spoken highly of by my fellow NAACLers. (This list is almost unadulterated: one person actually nominated one of the papers I thought shouldn't have gotten in, so I've left it off the list. Other than that, I think I've included everything that was specifically mentioned to me.)

  1. Optimal Parsing Strategies for Linear Context-Free Rewriting Systems by Daniel Gildea.

  2. Products of Random Latent Variable Grammars by Slav Petrov.

  3. Joint Parsing and Alignment with Weakly Synchronized Grammars by David Burkett, John Blitzer and Dan Klein.

  4. For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia by Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil and Lillian Lee.

  5. Type-Based MCMC by Percy Liang, Michael I. Jordan and Dan Klein.
I think I probably have two high level "complaints" about the program this year. First, I feel like we're seeing more and more "I downloaded blah blah blah data and trained a model using entirely standard features to predict something and it kind of worked" papers. I apologize if I've just described your paper, but these papers really rub me the wrong way. I feel like I just don't learn anything from them: we already know that machine learning works surprisingly well and I don't really need more evidence of that. Now, if my sentence described your paper, but your paper additionally had a really interesting analysis that helps us understand something about language, then you rock! Second, I saw a lot of presentations were speakers were somewhat embarassingly unaware of very prominent very relevant prior work. (In none of these cases was the prior work my own: it was work that's much more famous.) Sometimes the papers were cited (and it was more of a "why didn't you compare against that" issue) but very frequently they were not. Obviously not everyone knows about all papers, but I recognized this even for papers that aren't even close to my area.

Okay, I just ranted, so let's end on a positive note. I'm leaving the conference knowing more than when I went, and I had fun at the same time. Often we complain about the obvious type I errors and not-so-obvious type II errors, but overall I found the program strong. Many thanks to the entire program committee for putting together an on-average very good set of papers, and many thanks to all of you for writing these papers!