14 May 2016

A bad optimizer is not a good thing

A very popular style of research in NLP and ML is the math abstraction. You cast your learning problem as some sort of objective function that you want to optimize. Or, if you're feeling Bayesian, you write down a joint likelihood that you'll either sample from or, yes, turn into an objective function that you want to optimize. The optimizer is then typically considered a black box, aside from its hyperparameters which you often must tune.

This is a very attractive style of research and one that I've personally gotten a lot of leverage out of. (Though it's worth emphasizing for reviewer #2 out there that this is not the only want to solve problems, and that not having an explicit objective function that an optimizer is cranking on does not mean your problem formulation is bad.) The main advantage here is separation of concerns. I can worry about making a good objective function, and my optimization friends can worry about optimizing it. Of course it's never that simple and we must consider what can possibly be optimized when making our objective functions (see, for instance, the obsession with convexity 10 years ago and the anti-obsession with convexity today).

At the end of the day, like any abstraction, it's not perfect, and we need to break it occasionally.

There is, however, something that's struck me as peculiar going back to the hayday of topic models in 10 or so years ago, but which applies equally well now. This is a strange world in which we actually hope that our optimizer doesn't work (or s/optimizer/sampler/). To me, this seems like a bad thing to hope for. I think we need to acknowledge that for hard problems, or optimizers and samplers probably won't work, and need to take that into account. But I don't think we should actively hope that they don't work: that just seems broken.

There are two places where I see this hope arise.

1. Occasionally you see objective functions that have trivial solutions that you're hoping won't be found. I've seen this several times in the context of multimodal learning. Suppose that you want to learn some sort of joint embedding of English words and Japanese words. I can write down some embedding functions f(e) and g(j) using whatever neural net magic I want. The outputs are vectors. Now, I want to have them supervise each other, and so I make an objective function that looks like ||A f(e) - B g(j)||^2 for some matrices A and B. The joint objective is to learn A, B, and all the parameters inside f and g. This seems reasonable, until you realize that simply setting A=B=0 will provide not only a local, but a global minimum of this objective function. Of course since these things are non-convex, we often initialize randomly. And in most cases, this global minimum actually isn't found (otherwise such approaches would never work). But this is just a broken way to formulate a problem because I'm relying on the fact that my optimizer doesn't actually work.

2. Much more commonly, the hope of a broken optimizer comes from the heuristic of using "initialization" to build a better model. The general scheme here is that I have some prior knowledge that I want to inject into the system, and I do so by initializing my model's parameters informed by that knowledge. I then optimize. But again, here, we're hoping that the optimizer fails! If the optimizer were perfect, it should be insensitive to the initialization (modulo picking one of multiple global optima, but I don't think that's what's really going on here). You see the same thing in Bayesian land, wherein you basically hope that your sampler doesn't mix properly. Again, this is a case where we're relying on the fact that we don't think our optimizers will actually work.

I think it's fair to say that #1 is pretty broken; I think there are a number of possible responses to #2 that are worth discussing briefly.

Response 1: Early stopping. We rarely run optimizers (or samplers) to completion and instead stop them early based on some held out measure of generalization. For many optimizers, it's possible to show that early stopping it equivalent to regularization (I first saw this as a homework exercise in the old Bishop neural nets book, but it's been reinvented many times since then... and probably before then). You can alternatively pull out some online learning theory and argue that online updates are implicitly regularizing (toward zero). We therefore can conclude that initialization plus early stopping is essentially a form of regularization toward the initialization point. This is reasonable, but it would be nice to actually lay out the theory more solidly in a way that's not specific to a particular model, regularizer, optimizer triple. (Maybe this has been done but I haven't seen it.)

Response 2: Initialization isn't to help the model, it's to help the optimizer. This is a pretty solid response. I think that when we do initialization, this is probably how we should think about it. We think the model is perfect (or at least good) but that the optimizer is going to have a hard time, so we start it somewhere reasonable. I think this perspective is hard to reconcile with the "initialization as prior knowledge view" (it's possible, but requires some mental acrobatics), but it's a reasonable view.

So what should we be doing? Well, if someone wants to prove something along the lines of #1 that would be nice :). Or alternatively just point me to an existing result!

More generally, though, I think the model vs optimization abstraction is quite useful. If we want to incorporate prior knowledge into the model, then some sort of regularization (or prior) seems like a very reasonable approach. (Yes, there are others.) If we want to try to help out our optimizer, then initialization may be a reasonable approach, but it really depends on the optimizer. One can of course do both: the "regularize toward X" approach together with initialization at (or near) X. (Though in my experience often regularizing toward X means that initializing at X isn't necessary because it already introduces a good bias and symmetry breaking.)