DMVCCM.org~20080528~

   1 # -*- coding: mule-utf-8-unix -*-
   2 #+STARTUP: overview
   3 #+TAGS: OPTIMIZE
   4 #+STARTUP: hidestars
   5 #+TITLE: DMV/CCM
   6 #+AUTHOR: Kevin Brubeck Unhammer
   7 #+EMAIL: K.BrubeckUnhammer at student uva nl
   8 #+LANGUAGE: en
   9 #+SEQ_TODO: TOGROK TODO DONE
  10 #+OPTIONS: H:4 toc:3
  11
  12 * dmvccm
  13   DEADLINE: <2008-06-30 Mon>
  14 (But absolute, extended, really-quite-dead-now deadline: August 31...)
  15 [[file:src/dmv.py][dmv.py]]
  16 [[file:src/io.py][io.py]]
  17 ** TODO [#A] P_STOP
  18 [[file:src/dmv.py::DMV%20probabilities][dmv-P_STOP]]
  19 *** Remember: The P_STOP formula is upside-down (left-to-right also)
  20 *** How is the P_STOP formula different given other values for dir and adj?
  21 (Presumably, the P_STOP formula where STOP is True is just the
  22 rule-probability of _h_ -> STOP h_ or h_ -> h STOP, but how does
  23 adjacency fit in here?)
  24
  25 (And P_STOP(-STOP|...) = 1 - P_STOP(STOP|...) )
  26 ** TODO P_CHOOSE
  27 Write the formulas! should be easy?
  28 ** TOGROK Some (tagged) sentences are bound to come twice             :OPTIMIZE:
  29 Eg, first sort and count, so that the corpus
  30 [['nn','vbd','det','nn'],
  31  ['vbd','nn','det','nn'],
  32  ['nn','vbd','det','nn']]
  33 becomes
  34 [(['nn','vbd','det','nn'],2),
  35  (['vbd','nn','det','nn'],1)]
  36 and then in each loop through sentences, make sure we handle the
  37 frequency correctly.
  38
  39 Is there much to gain here?
  40 ** Initialization
  41 [[file:~/Documents/Skole/V08/Probability/dmvccm/src/dmv.py::Initialization%20todo][dmv-inits]]
  42 we Do have to go through the corpus, since the probabilities are based
  43 on how far away in the sentence arguments are from their heads.
  44 *** DONE DMV Initialization frequencies
  45     CLOSED: [2008-05-27 Tue 20:04]
  46 **** P_STOP
  47 P_STOP is not well defined by K&M. One possible interpretation given
  48 the sentence [det nn vb nn] is
  49 - f_STOP( STOP|det, L, adj) +1
  50 - f_STOP(-STOP|det, L, adj) +0
  51 - f_STOP( STOP|det, L, non_adj) +1
  52 - f_STOP(-STOP|det, L, non_adj) +0
  53 - f_STOP( STOP|det, R, adj) +0
  54 - f_STOP(-STOP|det, R, adj) +1
  55
  56 - f_STOP( STOP|nn, L, adj) +0
  57 - f_STOP(-STOP|nn, L, adj) +1
  58 - f_STOP( STOP|nn, L, non_adj) +1  # since there's at least one to the left
  59 - f_STOP(-STOP|nn, L, non_adj) +0
  60
  61 **** P_CHOOSE
  62 Go through the corpus, counting distances between heads and
  63 arguments. In [det nn vb nn], we give
  64 - f_CHOOSE(nn|det, R) +1/1 + C
  65 - f_CHOOSE(vb|det, R) +1/2 + C
  66 - f_CHOOSE(nn|det, R) +1/3 + C
  67   - If this were the full corpus, P_CHOOSE(nn|det, R) would have
  68     (1+1/3+2C) / sum_a f_CHOOSE(a|det, R)
  69
  70 The ROOT gets "each argument with equal probability", so in a sentence
  71 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
  72 just a frequency count of the corpus...
  73 *** TOGROK DMV Initialization probabilities from initialization frequency
  74 *** TOGROK CCM Initialization
  75 P_SPLIT used here... how, again?
  76 ** TOGROK Adjacency and combining it with inner()
  77 ** TOGROK Is the E-step in DMV just inner() on the full sentence?
  78 ** TOGROK What exactly is the M-step of DMV?
  79 *** COMMENT
  80 E: «For the initial symbols, I think we're supposed to
  81 implement someone-or-other's POS tagger and use the parts of speech as
  82 our tags. (K&M say they used parts of speech, anyway, and we can't use
  83 the ones from the annotated corpus or else we won't be unsupervised
  84 anymore.) I think you're right about how the tree is formed. The only
  85 thing I'm not sure about is whether those left and right marks are
  86 supposed to be part of the tree. I guess not because in a well-formed
  87 tree, every word needs a stop on both sides, so actually all nodes
  88 would be left and right marked in the end, which is the same as not
  89 using them at all. So I guess they're relevant in the E step of EM,
  90 but not the M step.» [[https://mail.google.com/mail/%3Ffs%3D1&tf%3D1&source%3Datom&view%3Dcv&search%3Dall&th%3D119b3e3d7d8f1d1f&shva%3D1&ui%3D1&disablechatbrowsercheck%3D1][gmail]]
  91
  92 K: About initialization, when they say they start with an M-step and get
  93 a harmonic distribution (where close dependencies have higher
  94 probabilities than far-away ones), I guess we could do this either in
  95 a function of its own, or as a special case of the IO-algorithm -- I'm
  96 not sure how different the harmonizing step will be from the regular
  97 M-step. But I'm pretty sure harmonizing means having to go through the
  98 whole corpus, since that's the only way we can find out distances for
  99 each dependency (without having actual distance-numbers in the rules
 100 in the first place). If sentence s is ABCD, rules from A to B should
 101 have higher probability than rules from A to D. Assuming we find some
 102 average distance between A and B, and A and D, etc. throughout the
 103 corpus, we then have to turn those numbers into probabilities that sum
 104 to 1; do you have any idea how to do that?
 105
 106 E: I think what it means to start with an M step is that we set the
 107 probabilities without regard for the initial corpus. They say that
 108 they start with an initial guess at "posterior distributions". I bet
 109 the word "posterior" is important, but I don't know what it means
 110 here. Do you?
 111 ** TODO Corpus access
 112 ** TOGROK How do we interpret DMV as an inside/outside process?
 113 c_s(x : i, j) is "the expected fraction of parses of s" with x from
 114 i to j; expectation then uses the probabilities gotten from
 115 initialization and previously gained probabilities, but these are of
 116 the form P_STOP and P_CHOOSE, how do we translate this to inside
 117 outside, which just uses the probabilities of CFG-rules?
 118
 119 ** Technical: sentences or rules as the "outer loop"?
 120 ** CCM-stuff
 121 ** DONE How do we know whether we are 'adjacent' or not?
 122 **** One configuration that I'm fairly certain of: right w/CHOOSE
 123 if we have
 124 \Tree [_b [_b b _c_ ] _d_ ]
 125 then the lower tree [_b b _c_ ] is adjacent since, working your way up
 126 the tree, no argument has been created to the right "yet"; while the
 127 outer tree [_b [_b ... ] _d_ ] is non-adjacent, since there is something in
 128 between... Is it thus always adjacent to the right if the distance
 129 is 2? (That is, in e(s,t,i) for the adjacent rule: t - s == 2; while
 130 in the non_adj rule: t - s == 4)
 131 ***** Implementing this:
 132 Two different DMVRules? Or just two different prob-values per rule?
 133 **** left w/CHOOSE
 134 Same deal here?
 135 **** R/L without CHOOSE, the "sealing operations"
 136 _h_ -> STOP h_ and h_ -> h STOP
 137
 138 What is "adjacency" here? That t - s == 1?
 139
 140
 141
 142
 143 * Python-stuff
 144 - [[file:src/pseudo.py][pseudo.py]]
 145 - http://nltk.org/doc/en/structured-programming.html recursive dynamic
 146 - http://nltk.org/doc/en/advanced-parsing.html
 147
 148