DMVCCM.html.~20~

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   2                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml"
   4 lang="en" xml:lang="en">
   5 <head>
   6 <title>DMV/CCM</title>
   7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
   8 <meta name="generator" content="Org-mode"/>
   9 <meta name="generated" content="2008/06/03 16:12:36"/>
  10 <meta name="author" content="Kevin Brubeck Unhammer"/>
  11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
  12 </head><body>
  13 <h1 class="title">DMV/CCM</h1>
  14 <div id="table-of-contents">
  15 <h2>Table of Contents</h2>
  16 <ul>
  17 <li><a href="#sec-1">1 dmvccm</a></li>
  18 <li><a href="#sec-2">2 Python-stuff</a></li>
  19 </ul>
  20 </div>
  21
  22 <div class="outline-2">
  23 <h2 id="sec-1">1 dmvccm</h2>
  24
  25 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
  26 (But absolute, extended, really-quite-dead-now deadline: August 31&hellip;)
  27 <a href="src/dmv.py">dmv.py</a>
  28 <a href="src/io.py">io.py</a>
  29 </p><ul>
  30 <li><span class="todo">TODO</span> Adjacency and combining it with inner()<br/>
  31 Each DMV_Rule now has both a probN and a probA, for
  32 adjacencies. inner() needs the correct one in each case.
  33
  34 <p>
  35 Adjacency gives a problem with duplicate words/tags, eg. in the
  36 sentence "a a b". If this has the dependency structure b-&gt;a<sub>0</sub>-&gt;a<sub>1</sub>,
  37 then b is non-adjacent to a<sub>0</sub> and should use probN (for the LRStop and
  38 the attachment of a<sub>0</sub>), while the other rules should all use
  39 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
  40 so it's not adjacent to 2", since there's also an a at index 1, and
  41 there's also a dependency structure b-&gt;a<sub>1</sub>-&gt;a<sub>0</sub> for that. We want
  42 both. And in possibly much more complex versions.
  43 </p>
  44 <p>
  45 Ideas:
  46 </p><ul>
  47 <li>
  48 I first thought of decorating the individual words/tags in a
  49 sentence with their indices, and perhaps just duplicating the
  50 relevant rules (one for each index of the duplicate tags). But this
  51 gives an explosion in attachment rules (although a contained
  52 explosion, within the rules used in a sentence; but most sentences
  53 will have at least two NN's so it will be a problem).
  54 </li>
  55 <li>
  56 Then, I had a <i>brilliant</i> idea. Just let e(), the helper function of
  57 inner(), parametrize for an extra pair of boolean values for whether
  58 or not we've attached anything to the left or right yet ("yet"
  59 meaning "below"). So now, e() has a chart of the form [s, t, LHS,
  60 Lattach, Rattach], and of course e(s,t,LHS) is the sum of the four
  61 possible values for (Lattach,Rattach). This makes e() lots more
  62 complex and DMV-specific though, so it's been rewritten in
  63 inner_dmv() in dmv.py.
  64 </li>
  65 <li><span class="todo">TODO</span> document this adjacency stuff better<br/>
  66 </li>
  67 <li><span class="todo">TODO</span> test and debug my brilliant idea<br/>
  68 </li>
  69 <li><span class="done">DONE</span> implement my brilliant idea.<br/>
  70 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-01 Sun 17:19</span><br/>
  71 <a href="src/dmv.py">e(sti) in dmv.py</a>
  72
  73 </li>
  74 <li><span class="done">DONE</span> [#A] test inner() on sentences with duplicate words<br/>
  75 Works with eg. the sentence "h h h"
  76
  77
  78 </li>
  79 </ul>
  80 </li>
  81 <li><span class="todo">TODO</span> [#A] P_STOP for IO/EM<br/>
  82 <a href="src/dmv.py">dmv-P_STOP</a>
  83 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also).
  84 (In the article..not the thesis)
  85
  86 <p>
  87 Remember: Initialization makes some "short-cut" rules, these will also
  88 have to be updated along with the other P<sub>STOP</sub> updates:
  89 </p><ul>
  90 <li>
  91 b[(NOBAR, n<sub>h</sub>), 'h'] = 1.0       # always
  92 </li>
  93 <li>
  94 b[(RBAR, n<sub>h</sub>), 'h'] = h_.probA  # h_ is RBAR stop rule
  95 </li>
  96 <li>
  97 b[(LRBAR, n<sub>h</sub>), 'h'] = h_.probA * _ h_.probA
  98
  99 </li>
 100 <li>How is the P_STOP formula different given other values for dir and adj?<br/>
 101 (Presumably, the P<sub>STOP</sub> formula where STOP is True is just the
 102 rule-probability of _ h_ -&gt; STOP h_ or h_ -&gt; h STOP, but how does
 103 adjacency fit in here?)
 104
 105 <p>
 106 (And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
 107 </p></li>
 108 </ul>
 109 </li>
 110 <li><span class="todo">TODO</span> P_CHOOSE for IO/EM<br/>
 111 Write the formulas! should be easy?
 112 </li>
 113 <li>Initialization   <br/>
 114 <a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
 115
 116 <p>
 117 We do have to go through the corpus, since the probabilities are based
 118 on how far away in the sentence arguments are from their heads.
 119 </p><ul>
 120 <li><span class="todo">TODO</span> Separate initialization to another file?                     &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
 121 (It's rather messy.)
 122 </li>
 123 <li><span class="todo">TOGROK</span> CCM Initialization    <br/>
 124 P<sub>SPLIT</sub> used here&hellip; how, again?
 125 </li>
 126 <li><span class="done">DONE</span> DMV Initialization probabilities<br/>
 127 (from initialization frequency)
 128 </li>
 129 <li><span class="done">DONE</span> DMV Initialization frequencies    <br/>
 130 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
 131 <ul>
 132 <li>P_STOP    <br/>
 133 P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
 134 the sentence [det nn vb nn] is
 135 <pre>
 136  f_{STOP}( STOP|det, L, adj) +1
 137  f_{STOP}(-STOP|det, L, adj) +0
 138  f_{STOP}( STOP|det, L, non_adj) +1
 139  f_{STOP}(-STOP|det, L, non_adj) +0
 140  f_{STOP}( STOP|det, R, adj) +0
 141  f_{STOP}(-STOP|det, R, adj) +1
 142
 143  f_{STOP}( STOP|nn, L, adj) +0
 144  f_{STOP}(-STOP|nn, L, adj) +1
 145  f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 146  f_{STOP}(-STOP|nn, L, non_adj) +0
 147 </pre>
 148 <ul>
 149 <li><span class="todo">TODO</span> tweak<br/>
 150 <a name="pstoptweak">&nbsp;</a>
 151 <pre>
 152             f[head,  'STOP', 'LN'] += (i_h &lt;= 1)     # first two words
 153             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1)
 154             f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 155             f[head, '-STOP', 'LA'] += (not i_h == 0)
 156             f[head,  'STOP', 'RN'] += (i_h &gt;= n - 2) # last two words
 157             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2)
 158             f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 159             f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 160 </pre>
 161 vs
 162 <pre>
 163             # this one requires some additional rewriting since it
 164             # introduces divisions by zero
 165             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 166             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1) # not first two
 167             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 168             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 169             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 170             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2) # not last two
 171             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 172             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 173 </pre>
 174 vs
 175 <pre>
 176             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 177             f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 178             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 179             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 180             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 181             f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 182             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 183             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 184 </pre>
 185 vs
 186 "all words take the same number of arguments" interpreted as
 187 <pre>
 188 for all heads:
 189     p_STOP(head, 'STOP', 'LN') = 0.3
 190     p_STOP(head, 'STOP', 'LA') = 0.5
 191     p_STOP(head, 'STOP', 'RN') = 0.4
 192     p_STOP(head, 'STOP', 'RA') = 0.7
 193 </pre>
 194 (which we easily may tweak in init_zeros())
 195 </li>
 196 </ul>
 197 </li>
 198 <li>P_CHOOSE<br/>
 199 Go through the corpus, counting distances between heads and
 200 arguments. In [det nn vb nn], we give
 201 <ul>
 202 <li>
 203 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
 204 </li>
 205 <li>
 206 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
 207 </li>
 208 <li>
 209 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
 210 <ul>
 211 <li>
 212 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
 213 (1+1/3+2C) / sum_a f<sub>CHOOSE</sub>(a|det, R)
 214
 215 </li>
 216 </ul></li>
 217 </ul>
 218 <p>The ROOT gets "each argument with equal probability", so in a sentence
 219 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 220 just a frequency count of the corpus&hellip;
 221 </p></li>
 222 </ul>
 223 </li>
 224 </ul>
 225 </li>
 226 <li>[#C] Deferred<br/>
 227 <ul>
 228 <li><span class="todo">TODO</span> inner_dmv() should disregard rules with heads not in sent    &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 229 If the sentence is "nn vbd det nn", we should not even look at rules
 230 where
 231 <pre>
 232  rule.head() not in "nn vbd det nn".split()
 233 </pre>
 234 Also, we optimize this further by saying we don't even look at rules
 235 where
 236 <pre>
 237  rule.head() not in sent[ s :r+1]
 238  rule.head() not in sent[r+1:t+1]
 239 </pre>
 240 meaning, if we're looking at the span "vbd det", we only use
 241 attachment rules where both daughters are members of ['vbd','det']
 242 (although we don't (yet) care about removing rules that rewrite to the
 243 same tag if there are no duplicate tags in the span, etc., that would
 244 be a lot of trouble for little potential gain).
 245
 246 <p>
 247 Since STOP rules necessarily come from attachment rules, this ensures
 248 we don't end up with pointless STOP rules either, and if we <i>started</i>
 249 with a pointless, we nevertheless stop at the first rewrite-round
 250 (although we could also make that check in the first call to e()).
 251 </p></li>
 252 <li><span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon  &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 253 </li>
 254 <li><span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment            &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 255 If s-t &lt;= 2, there can be only one attachment below, so don't recurse
 256 with both Lattach=True and Rattach=True.
 257
 258 <p>
 259 If s-t &lt;= 1, there can be no attachment below, so only recurse with
 260 Lattach=False, Rattach=False.
 261 </p>
 262 <p>
 263 Put this in the loop under rewrite rules (could also do it in the STOP
 264 section, but that would only have an effect on very short sentences).
 265 </p></li>
 266 <li><span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice            &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 267 Eg, first sort and count, so that the corpus
 268 [['nn','vbd','det','nn'],
 269 ['vbd','nn','det','nn'],
 270 ['nn','vbd','det','nn']]
 271 becomes
 272 [(['nn','vbd','det','nn'],2),
 273 (['vbd','nn','det','nn'],1)]
 274 and then in each loop through sentences, make sure we handle the
 275 frequency correctly.
 276
 277 <p>
 278 Is there much to gain here?
 279 </p>
 280 </li>
 281 <li><span class="todo">TOGROK</span> tags as numbers or tags as strings?                        &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 282 Need to clean up the representation.
 283
 284 <p>
 285 Stick with tag-strings in initialization then switch to numbers for
 286 IO-algorithm perhaps? Can probably afford more string-matching in
 287 initialization..
 288 </p></li>
 289 </ul>
 290 </li>
 291 <li>Expectation Maximation in IO/DMV-terms<br/>
 292 inner(s,t,LHS) calculates the expected number of trees headed by LHS
 293 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
 294 values, which have been conveniently distributed into CNF rules as
 295 probN and probA (non-adjacent and adjacent probabilites).
 296
 297 <p>
 298 When re-estimating, we use the expected values from inner() to get new
 299 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
 300 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
 301 that in the next round we use new probN and probA to find
 302 inner-probabilites.
 303 </p>
 304 <p>
 305 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 306 init_normalize() (here along with the creation of P_STOP and
 307 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 308 rule is STOP, P_CHOOSE is used to create rules of the form
 309 <pre>
 310  h  -&gt; h  _a_
 311  h_ -&gt; h_ _a_
 312 </pre>
 313 </p>
 314 <p>
 315 Since "adjacency" is not captured in regular CNF rules, we need two
 316 probabilites for each rule, and inner() has to know when to use which.
 317 </p>
 318 <ul>
 319 <li><span class="todo">TODO</span> Corpus access<br/>
 320 </li>
 321 <li><span class="todo">TOGROK</span> sentences or rules as the "outer loop"?                    &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 322 In regard to the E/M-step, finding P<sub>STOP</sub>, P<sub>CHOOSE</sub>.
 323
 324
 325 </li>
 326 </ul>
 327 </li>
 328 </ul>
 329 </div>
 330
 331 <div class="outline-2">
 332 <h2 id="sec-2">2 Python-stuff</h2>
 333
 334 <ul>
 335 <li>
 336 <a href="src/pseudo.py">pseudo.py</a>
 337 </li>
 338 <li>
 339 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
 340 </li>
 341 <li>
 342 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
 343 </li>
 344 <li>
 345 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
 346
 347
 348 </li>
 349 </ul>
 350 </div>
 351 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
 352 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
 353 </p>
 354 <p class="date"> Date: 2008/06/03 16:12:36</p>
 355 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
 356 </html>