There was a talk in the session I chaired at ACL about directly optimizing CRFs to produce low F-scores for problems like NE tagging and chunking. The technique is fairly clever and is based on the observation that you can use very similar dynamic programming techniques to do max F-score as to do max log-probability in CRFs.The details are not particularly important, but during the question phase, Chris Manning asked the following question: Given that F-score is not really motivated (i.e., is a type 4 loss), should we really be trying to optimize it? Conditional probability seems like a completely reasonable thing to want to maximize, given that we don't know how the tags will be used down the pipeline. (It seems Chris is also somewhat biased by a paper he subsequently had at EMNLP talking about sampling in pipelines.)
I think Chris' point is well taken. Absent any other information, conditional probably seems like a quite plausible thing to want to optimize, since given the true conditional probabilities, we can plug in any loss function at test time and do minimum Bayes risk (in theory, at least).
On the other hand, there is an interesting subtlety here. Conditional probability of what? The standard CRF optimization tries to maximize conditional probability of the entire sequence. The "alternative objective" of Kakade, Teh and Roweis optimizes the sub of the conditional probabilities of each label. These are two quite different criterea, and which one should we choose? In fact, neither really seems appropriate. Conditional probability of the sequence doesn't make sense because it would rather improve a bad label from probability 0.01 to 0.1 than improve a bad label from 0.4 to 0.6 and thus get it right. But summed conditional probability of labels doesn't make sense in NE tagging tasks because always assigning probability 0.9 to "not an entity" will do quite well. This is essentially the "accuracy versus f-score" problem, where, when few elements are actually "on," accuracy is a pretty terrible metric.
If we take Chris' advice and desire a conditional probability, it seems what we really want is direct conditional probability over the chunks! But how do we formulate this and how do we optimize it? My impression is that a direct modification of the paper Chris was asking about would actually enable us to do exactly that. So, while the authors of this paper were focusing on optimizing F-score, I think they've also given us a way to optimize conditional chunk probabilities (actually this should be easier than F-score because there are fewer forward/backward dependencies), similar to what Kakade et al. did for conditional label probabilities.