31 July 2014

Reading group notes: point/counter-point on "predict models"

In our local summer reading group, I led the discussion of two papers that appeared in Baltimore last month:

I love handouts, so I made a handout for this one too. I paste below the handout. All good ideas are those of the respective authors; all errors and bad ideas are probably due to bad transcription on my  part.

Don't count, predict! A systematic comparison of context-counting
vs. context-predicting semantic vectors

Marco Baroni & Georgiana Dinu & Germ\'an Kruszewski

Motivation: these silly deep learning people keep writing papers but
don't compare to traditional distributional semantics models. So we

Conclusion: okay, those people are actually right.

== Background ==

Distributional semantics = you know a word by the company it keeps

"Count models":
 * For each word type, collect context vectors
 * Context vectors look at n words on the left and right
     (with position info? together or separately?)
     varied in 2..5
 * Each type is represented by the bag of contexts in which it appears
 * Contexts are scored by PMI or LLR
     (gets rid of useful but frequent contexts)
 * We might reduce dimensionality to k in { 200, ..., 500 }
    - using either SVD or NNMF

"Predict models" (aka deep learning):
 * Assume a mapping from word type -> k-dim vector
 * Learn a model to predict any word token given the vectors
     of the n words to its left and right
     varied in {2,5}
 * Words are thrown out proportional to their frequency
    - makes things faster
    - reduces importance of frequent words like IDF
 * Vary k in { 200, ..., 500 }

CW (Collobert & Weston) models:
 * freely available online
 * 100 dimensional vectors trained for 2 months on wikipedia
 * predict a word with 5 words to the left and right
 * used extensively in other literature

== Tasks ==
 * Synonym detection from TOEFL
   - given "levied" choose from { imposed, believed, requested, correlated }
   - compute cosine of word representations
 * Concept categorization
   - cluster words like { helicopters, motorcyles} into one class
     and { dogs, elephants } into another
   - used off-the-shelf clustering algorithms
 * Selectional preferences
    - given a verb/noun pair say whether the verb selects for that noun
    - eg, "eat apples" versus "eat gravity"
    - for each verb, take 20 most strongly associated nouns, average
        their representations, and measure cosine similarity to that
 * Analogy
    - eg, brother:sister :: grandson:(?granddaughter)
    - find nearest neighbor of (brother - sister + grandson)

== Summary of results ==
                                     pred    tie   count    (win: >=5)
                                     wins           wins
 * tune parameters PER TASK:          10      4      0
 * tune parameters OVERALL:           10      3      1
 * worst parameters                   14      0      0
 * best from relatedness:             11      3      0

 - best count   model: window=2, PMI, no compression, 300k dimensions
 - best predict model: window=5, no hier softmax, neg sampling, 400 dim

Linguistic Regularities in Sparse and Explicit Word Representations
Omer Levy & Yoav Goldberg

Motivation: neural representations capture analogical reasoning well;
why does this happen?

Conclusion: we can just as well using tranditional distributed word
representations if we measure similarity in a "multiplicative" way

== Background ==

* neural language models produce representations that can answer analogy questions:
  - gender...   man:woman :: king:queen
  - speakers... france:french :: mexico:spanish
  - number...   apple:apples :: car:cars
* question: how much of this is a property of *embeddings* (ie dense, low-dimensional)
  - alternative is distributional similarity == bag of contexts (ala Baroni paper)

== Experiment 1 ==

* Mikolov (word2vec) computes similarity for solving a:b :: a*:b* by finding:
    arg max_{b*}  similarity(b*, a* - a + b)       (called 3CosAdd)
   aka similarity(queen, king - man + woman)
* they use cosine similarity: cos(u,v) = dot(u,v) / [ ||u||  ||v|| ]
* expanding out, you get:
    arg max_{b*}  cos(b*,b) + cos(b*,a*) - cos(b*,a)
    cos(queen,woman) + cos(queen,king) - cos(queen,man)

Results: on MSR & Google datasets, embeggings >> explicit (predict >> count)
         on SemEval, basically tied (closed vocabulary?)

* an alternative is:
    arg max_{b*}  similarity(b*-b, a*-a)           (called PairDirection)
  aka similarity(queen-woman, king-man)

Results: Much much worse on MSR, Google (open vocab), and better (though tied) on
SemEval. Perhaps because scale matters for open vocabulary? could have been tested
explicitly... (drat!)

== Experiment 2 ==

Looking at the expansion of 3CosAdd, it looks like a "noisy or" sort of operation:
 - b* should be close to b, should be close to a*, should be far from a...

What about using something more like noisy and:

            cos(b*,b)  cos(b*,a*)
  arg max  -----------------------
      b*       cos(b*,a) + eps


            cos(queen,king)  cos(queen,woman)
  arg max  -----------------------------------
      b*          cos(queen,man) + eps


                      MSR       Google
  3CosAdd  Predict    54%       63%
           Count      29%       45%
  3CosMul  Predict    59%       67%
           Count      57%       68%

One against the other:

  Both correct    -- 54% of cases
  Both wrong      -- 24%
  Predict correct -- 11.1%
  Count correct   -- 11.6%

27 July 2014

Hello, World!

Okay, usually Hello World is the first program you learn to write in a new programming language. For fun, I've been collecting how to say hello world in different human languages, something remarkably difficult to search for (because of the overloading of the word "language"). I have 28. I'd like to make it to 280 :). If you have one (or more) to contribute, email me, post a comment, or tweet to me @haldaume3. And of course if you think any of these is wrong, please let me know that too.

     1 bar Servus Woid!
     2 ca  Hola Món!
     3 de  Hallo Welt!
     4 en  Hello World!
     5 eo  Saluton, Mondo!
     6 es  ¡Hola Mundo!
     7 eu  Kaixo, mundua!
     8 fi  Hei maailma!
     9 hu  Helló, világ!
    10 ia  Hallo, mundo!
    11 id  Halo dunia!
    12 ja  こんにちは世界
    13 lv  Sveika, pasaule!
    14 min Helo dunia!
    15 mk  Здраво свету!
    16 ms  Helo dunia!
    17 nn  Hallo verda!
    18 no  Hallo, verden!
    19 pt  Olá Mundo!
    20 sh  Zdravo svete!
    21 sl  Pozdravljen svet!
    22 sq  Njatjeta Botë!
    23 sr  Здраво свете!
    24 sv  Hej Världen!
    25 th  เฮลโลเวิลด์
    26 tr  Merhaba dünya!
    27 vi  Xin chào thế giới!
    28 zh  世界,你好!

05 July 2014

My ACL 2014 picks...

Usual caveats: didn't see all papers, blah blah blah. Also look for #acl14nlp on twitter -- lots of papers were mentioned there too!

  • A Tabular Method for Dynamic Oracles in Transition-Based Parsing; Yoav Goldberg, Francesco Sartorio, Giorgio Satta.
    Jaokim Nivre, Ryan McDonald and I tried searnifying MaltParser back in 2007 and never got it to work. Perhaps this is because we didn't have dynamic oracles and we thought that a silly approximate oracle would be good enough. Guess not. Yoav, Francesco and Giorgio have a nice technique for efficiently computing the best possible-to-achieve dependency parse given some prefix, possibly incorrect, parse.
  • Joint Incremental Disfluency Detection and Dependency Parsing; Matthew Honnibal, Mark Johnson
    The basic idea is to do shift-reduce dependency parsing, but allow for "rewinds" in the case of (predicted) disfluencies. I like that they didn't just go with the most obvious model and actually thought about how might be a good way to solve this problem. Basic idea is if you get "Please book a flight to Boston uh to Denver..." is that you parse "to Boston" like usual but then when you get to the "uh", you remove old arcs. You do it this way because detecting the disfluent segment ("to Boston") is much easier when you hit "uh" than when you hit "to Boston."
  • Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors; Marco Baroni; Georgiana Dinu; Germán Kruszewski
    This paper is summarized best by its own statement, which should win it the award for most honest paper ever: "...we set out to conduct this study because we were annoyed by the triumphalist overtones often surrounding [neural network embeddings], despite the almost complete lack of a proper comparison.... Our secret wish was to discover that it is all hype... Instead, we found that the [embeddings] are so good that, while the triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture."
  • Learning to Automatically Solve Algebra Word Problems ; Nate Kushman; Luke Zettlemoyer; Regina Barzilay; Yoav Artzi
    An algebra word problem is something like "I have twice as many dimes as nickles and have $2.53. How many nickles do I have." Of course usually they actually have an answer. They have a nice, fairly linguistically unstructured (i.e., no CCG) for mapping word problems to algebraic formulae and then solving those formulae. Code/data available.
  • Grounded Compositional Semantics for Finding and Describing Images with Sentences; Richard Socher, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng
    This is the follow-on work from Richard's NIPS workshop paper on text <-> images from this past NIPS. They fixed the main bug in that paper (the use of l2 error, which gives a trivial and uninteresting global optimal solution) and get nice results. If you're in the langvis space, worth a read, even if you don't like neural networks :).
  • From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions; Peter Young, Alice Lai, Micah Hodosh, Julia Hockenmaier
    I really like the "visual denotations" idea here. Basically you say something like "the set of worlds in which this sentence is true is the set of images in which this sentence is true (i.e., roughly the sentence is entailed by the image)." You can then measure similarity between sentences based on denotations.
  • Kneser-Ney Smoothing on Expected Counts; Hui Zhang; David Chiang
    I didn't actually see this talk or read the paper, but lots of people told me in hallways that this is a very nice result. Basically we like KN smoothing, but it only works for integral counts, which means it's hard to incorporate into something like EM, which produces fractional counts. This paper solves this problem.
  • Linguistic Structured Sparsity in Text Categorization; Dani Yogatama; Noah A. Smith
    Also didn't see this one, but skimmed the paper. The reason I really like this paper is because they took a well known technique in ML land (structured sparsity) and applied it to NLP, but in an interesting way. I.e., it wasn't just "apply X to Y" but rather find a very linguistically clever/interesting way of mapping X to a problem that we care about. Very cool work.
Overall I really liked the conference, thanks to everyone who helped put it together. I can't help but notice that about half of my picks above were actually TACL papers. I suspect this will be more and more true over time.

Please add comments with your favorite papers that I missed!