DMVCCM.org.~8~

   1 # -*- coding: mule-utf-8-unix -*-
   2 #+STARTUP: overview
   3 #+TAGS: OPTIMIZE PRETTIER
   4 #+STARTUP: hidestars
   5 #+TITLE: DMV/CCM
   6 #+AUTHOR: Kevin Brubeck Unhammer
   7 #+EMAIL: K.BrubeckUnhammer at student uva nl
   8 #+LANGUAGE: en
   9 #+SEQ_TODO: TOGROK TODO DONE
  10 #+OPTIONS: H:4 toc:3
  11
  12 * dmvccm
  13   DEADLINE: <2008-06-30 Mon>
  14 (But absolute, extended, really-quite-dead-now deadline: August 31...)
  15 [[file:src/dmv.py][dmv.py]]
  16 [[file:src/io.py][io.py]]
  17 ** TODO Adjacency and combining it with inner()
  18 Each DMV_Rule now has both a probN and a probA, for
  19 adjacencies. inner() needs the correct one in each case.
  20
  21 Adjacency gives a problem with duplicate words/tags, eg. in the
  22 sentence "a a b". If this has the dependency structure b->a_0->a_1,
  23 then b is non-adjacent to a_0 and should use probN (for the LRStop and
  24 the attachment of a_0), while the other rules should all use
  25 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
  26 so it's not adjacent to 2", since there's also an a at index 1, and
  27 there's also a dependency structure b->a_1->a_0 for that. We want
  28 both. And in possibly much more complex versions.
  29
  30 Ideas:
  31 - I first thought of decorating the individual words/tags in a
  32   sentence with their indices, and perhaps just duplicating the
  33   relevant rules (one for each index of the duplicate tags). But this
  34   gives an explosion in attachment rules (although a contained
  35   explosion, within the rules used in a sentence; but most sentences
  36   will have at least two NN's so it will be a problem).
  37 - Then, I had a /brilliant/ idea. Just let e(), the helper function of
  38   inner(), pass up an extra pair of boolean values for whether or not
  39   we've attached anything to the left or right yet. So now, e() should
  40   return a list of [prob, Lattach, Rattach]. Whether or not we've
  41   attached anything is possible to see from the local rule and whether
  42   there is attachment below.
  43 - Althought, this idea needed some refinement...
  44 *** TODO: implement my brilliant idea.
  45 [[file:src/io.py::def%20e%20s%20t%20LHS][e(sti) in io.py]]
  46
  47 *** DONE [#A] test inner() on sentences with duplicate words
  48 Works with eg. the sentence "h h h"
  49
  50
  51 ** TODO [#A] P_STOP for IO/EM
  52 [[file:src/dmv.py::DMV%20probabilities][dmv-P_STOP]]
  53 Remember: The P_STOP formula is upside-down (left-to-right also).
  54 (In the article..not the thesis)
  55 *** How is the P_STOP formula different given other values for dir and adj?
  56 (Presumably, the P_STOP formula where STOP is True is just the
  57 rule-probability of _h_ -> STOP h_ or h_ -> h STOP, but how does
  58 adjacency fit in here?)
  59
  60 (And P_STOP(-STOP|...) = 1 - P_STOP(STOP|...) )
  61 ** TODO P_CHOOSE for IO/EM
  62 Write the formulas! should be easy?
  63 ** Initialization
  64 [[file:~/Documents/Skole/V08/Probability/dmvccm/src/dmv.py::Initialization%20todo][dmv-inits]]
  65
  66 We do have to go through the corpus, since the probabilities are based
  67 on how far away in the sentence arguments are from their heads.
  68 *** TODO Separate initialization to another file?                     :PRETTIER:
  69 (It's rather messy.)
  70 *** TOGROK CCM Initialization
  71 P_SPLIT used here... how, again?
  72 *** DONE DMV Initialization probabilities
  73 (from initialization frequency)
  74 *** DONE DMV Initialization frequencies
  75     CLOSED: [2008-05-27 Tue 20:04]
  76 **** P_STOP
  77 P_STOP is not well defined by K&M. One possible interpretation given
  78 the sentence [det nn vb nn] is
  79 - f_STOP( STOP|det, L, adj) +1
  80 - f_STOP(-STOP|det, L, adj) +0
  81 - f_STOP( STOP|det, L, non_adj) +1
  82 - f_STOP(-STOP|det, L, non_adj) +0
  83 - f_STOP( STOP|det, R, adj) +0
  84 - f_STOP(-STOP|det, R, adj) +1
  85
  86 - f_STOP( STOP|nn, L, adj) +0
  87 - f_STOP(-STOP|nn, L, adj) +1
  88 - f_STOP( STOP|nn, L, non_adj) +1  # since there's at least one to the left
  89 - f_STOP(-STOP|nn, L, non_adj) +0
  90
  91 **** P_CHOOSE
  92 Go through the corpus, counting distances between heads and
  93 arguments. In [det nn vb nn], we give
  94 - f_CHOOSE(nn|det, R) +1/1 + C
  95 - f_CHOOSE(vb|det, R) +1/2 + C
  96 - f_CHOOSE(nn|det, R) +1/3 + C
  97   - If this were the full corpus, P_CHOOSE(nn|det, R) would have
  98     (1+1/3+2C) / sum_a f_CHOOSE(a|det, R)
  99
 100 The ROOT gets "each argument with equal probability", so in a sentence
 101 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 102 just a frequency count of the corpus...
 103 ** [#C] Deferred
 104 *** TOGROK Some (tagged) sentences are bound to come twice            :OPTIMIZE:
 105 Eg, first sort and count, so that the corpus
 106 [['nn','vbd','det','nn'],
 107  ['vbd','nn','det','nn'],
 108  ['nn','vbd','det','nn']]
 109 becomes
 110 [(['nn','vbd','det','nn'],2),
 111  (['vbd','nn','det','nn'],1)]
 112 and then in each loop through sentences, make sure we handle the
 113 frequency correctly.
 114
 115 Is there much to gain here?
 116
 117 *** TOGROK tags as numbers or tags as strings?                        :OPTIMIZE:
 118 Need to clean up the representation.
 119
 120 Stick with tag-strings in initialization then switch to numbers for
 121 IO-algorithm perhaps? Can probably afford more string-matching in
 122 initialization..
 123 *** TOGROK Is the E-step in DMV just inner() on the full sentence?
 124 and What exactly is the M-step of DMV?
 125 *** COMMENT
 126 E: «For the initial symbols, I think we're supposed to
 127 implement someone-or-other's POS tagger and use the parts of speech as
 128 our tags. (K&M say they used parts of speech, anyway, and we can't use
 129 the ones from the annotated corpus or else we won't be unsupervised
 130 anymore.) I think you're right about how the tree is formed. The only
 131 thing I'm not sure about is whether those left and right marks are
 132 supposed to be part of the tree. I guess not because in a well-formed
 133 tree, every word needs a stop on both sides, so actually all nodes
 134 would be left and right marked in the end, which is the same as not
 135 using them at all. So I guess they're relevant in the E step of EM,
 136 but not the M step.» [[https://mail.google.com/mail/%3Ffs%3D1&tf%3D1&source%3Datom&view%3Dcv&search%3Dall&th%3D119b3e3d7d8f1d1f&shva%3D1&ui%3D1&disablechatbrowsercheck%3D1][gmail]]
 137
 138 K: About initialization, when they say they start with an M-step and get
 139 a harmonic distribution (where close dependencies have higher
 140 probabilities than far-away ones), I guess we could do this either in
 141 a function of its own, or as a special case of the IO-algorithm -- I'm
 142 not sure how different the harmonizing step will be from the regular
 143 M-step. But I'm pretty sure harmonizing means having to go through the
 144 whole corpus, since that's the only way we can find out distances for
 145 each dependency (without having actual distance-numbers in the rules
 146 in the first place). If sentence s is ABCD, rules from A to B should
 147 have higher probability than rules from A to D. Assuming we find some
 148 average distance between A and B, and A and D, etc. throughout the
 149 corpus, we then have to turn those numbers into probabilities that sum
 150 to 1; do you have any idea how to do that?
 151
 152 E: I think what it means to start with an M step is that we set the
 153 probabilities without regard for the initial corpus. They say that
 154 they start with an initial guess at "posterior distributions". I bet
 155 the word "posterior" is important, but I don't know what it means
 156 here. Do you?
 157 *** TODO Corpus access
 158 *** TOGROK sentences or rules as the "outer loop"?                    :OPTIMIZE:
 159 In regard to the E/M-step, finding P_STOP, P_CHOOSE.
 160 ** TOGROK How do we know whether we are 'adjacent' or not?
 161 **** One configuration that I'm fairly certain of: right w/CHOOSE
 162 if we have
 163 \Tree [_b [_b b _c_ ] _d_ ]
 164 then the lower tree [_b b _c_ ] is adjacent since, working your way up
 165 the tree, no argument has been created to the right "yet"; while the
 166 outer tree [_b [_b ... ] _d_ ] is non-adjacent, since there is something in
 167 between... Is it thus always adjacent to the right if the distance
 168 is 2? (That is, in e(s,t,i) for the adjacent rule: t - s == 2; while
 169 in the non_adj rule: t - s == 4)
 170 ***** Implementing this:
 171 Two different DMVRules? Or just two different prob-values per rule?
 172 **** left w/CHOOSE
 173 Same deal here?
 174 **** R/L without CHOOSE, the "sealing operations"
 175 _h_ -> STOP h_ and h_ -> h STOP
 176
 177 What is "adjacency" here? That t - s == 1?
 178
 179
 180
 181
 182 * Python-stuff
 183 - [[file:src/pseudo.py][pseudo.py]]
 184 - http://nltk.org/doc/en/structured-programming.html recursive dynamic
 185 - http://nltk.org/doc/en/advanced-parsing.html
 186 - http://jaynes.colorado.edu/PythonIdioms.html
 187
 188