26 February 2006

Evaluation Criterea: Correlation versus Generalization

When developing a "third level" loss function (i.e., an automatic metric), one often shows that the ranking or the scores of the automatic function correlates well with the ranking or scores of the human-level function.

I'm not sure this is the best thing to do. The problem is that what correlation tell us is just that: on the data sample we have, our automatic evaluation function (AE) correlates well with the human evaluation function (HE). But we don't actually care (directly) about the correlation on past data. We are about the ability to make future predictions of quality. Namely, we often want to know:

If I improve my system by 1 point AE, will this improve my HE score?

This is a generalization question, not a correlation question. But we know how to analyze generalization questions: Machine learning theorists do it all the time!

So, what do we know: if we want AE to predict HE well in the generalization sense, then it is (with high probability) sufficient that AE predicts HE well on our observed (training) data, and that AE is not "too complex." Usually this latter holds. AE is often chosen to be as simple as possible, otherwise people probably won't buy it. Usually it's chosen from a small finite set, which means we can do really simple PAC-style analyses. So this should be enough, right: we have good prediction of the training data (the correlation) and a small hypothesis class, so we'll get good generalization. Thus, correlation is enough.

Well, not quite.

Correlation+small hypothesis class is enough only when the "training data" is i.i.d. And our problem assumptions kill both of the "i"s. First, our data is not independent. In summarization (and to a perhaps slightly lesser extent in MT), the systems are very very similar to eachother. In fact, when people submit multiple runs, they are hugely similar. Second, our data is not identically distributed to the test data. The whole point of this exercise was to answer the question: if I improve my AE will by HE improve? But, at least if I have the top system, if I improve my AE too much, my system is no longer from the same distribution of systems that generated the training data. So the data is hardly i.i.d. and easy generalization bounds cannot be had.

So what can we do about this? Well, recognizing it is a start. More practically, it might be worth subsampling from systems when deriving correlation results. The systems should be chosen to be (a) as diverse as possible and (b) as high performance as possible. The diversity helps with independence. The high performance means that we're not geting supurb correlation on crummy systems and no correlation on the great systems, leading to poor generalization as systems get better. This doesn't fix the problem, but it might help. And, of course, whenever possible we should actually run HE.

1 comment:

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花