DMVCCM.html~20080528~

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   2                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml"
   4 lang="en" xml:lang="en">
   5 <head>
   6 <title>DMV/CCM</title>
   7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
   8 <meta name="generator" content="Org-mode"/>
   9 <meta name="generated" content="2008/05/27 20:07:11"/>
  10 <meta name="author" content="Kevin Brubeck Unhammer"/>
  11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
  12 </head><body>
  13 <h1 class="title">DMV/CCM</h1>
  14 <div id="table-of-contents">
  15 <h2>Table of Contents</h2>
  16 <div id="text-table-of-contents">
  17 <ul>
  18 <li><a href="#sec-1">1 dmvccm</a>
  19 <ul>
  20 <li><a href="#sec-1.1">1.1 [#A] P<sub>STOP</sub></a>
  21 <ul>
  22 <li><a href="#sec-1.1.1">1.1.1 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also)</a></li>
  23 <li><a href="#sec-1.1.2">1.1.2 How is the P<sub>STOP</sub> formula different given other values for dir and adj?</a></li>
  24 </ul>
  25 </li>
  26 <li><a href="#sec-1.2">1.2 P<sub>CHOOSE</sub></a></li>
  27 <li><a href="#sec-1.3">1.3 Some (tagged) sentences are bound to come twice</a></li>
  28 <li><a href="#sec-1.4">1.4 Initialization   </a>
  29 <ul>
  30 <li><a href="#sec-1.4.1">1.4.1 DMV Initialization frequencies    </a></li>
  31 <li><a href="#sec-1.4.2">1.4.2 DMV Initialization probabilities from initialization frequency    </a></li>
  32 <li><a href="#sec-1.4.3">1.4.3 CCM Initialization    </a></li>
  33 </ul>
  34 </li>
  35 <li><a href="#sec-1.5">1.5 Adjacency and combining it with inner()</a></li>
  36 <li><a href="#sec-1.6">1.6 Is the E-step in DMV just inner() on the full sentence? </a></li>
  37 <li><a href="#sec-1.7">1.7 [#C] Corpus access</a></li>
  38 <li><a href="#sec-1.8">1.8 How do we interpret DMV as an inside/outside process?   </a></li>
  39 <li><a href="#sec-1.9">1.9 Technical: sentences or rules as the "outer loop"?</a></li>
  40 <li><a href="#sec-1.10">1.10 CCM-stuff</a></li>
  41 <li><a href="#sec-1.11">1.11 How do we know whether we are 'adjacent' or not? </a></li>
  42 </ul>
  43 </li>
  44 <li><a href="#sec-2">2 Python-stuff</a></li>
  45 </ul>
  46 </div>
  47 </div>
  48
  49 <div id="outline-container-1" class="outline-2">
  50 <h2 id="sec-1">1 dmvccm</h2>
  51 <div id="text-1">
  52
  53 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
  54 (But absolute, extended, really-quite-dead-now deadline: August 31&hellip;)
  55 <a href="src/dmv.py">dmv.py</a>
  56 <a href="src/io.py">io.py</a>
  57 </p>
  58 </div>
  59
  60 <div id="outline-container-1.1" class="outline-3">
  61 <h3 id="sec-1.1">1.1 <span class="todo">TODO</span> [#A] P<sub>STOP</sub></h3>
  62 <div id="text-1.1">
  63
  64 <p><a href="src/dmv.py">dmv-P_STOP</a>
  65 </p>
  66 </div>
  67
  68 <div id="outline-container-1.1.1" class="outline-4">
  69 <h4 id="sec-1.1.1">1.1.1 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also)</h4>
  70 <div id="text-1.1.1">
  71
  72 </div>
  73
  74 </div>
  75
  76 <div id="outline-container-1.1.2" class="outline-4">
  77 <h4 id="sec-1.1.2">1.1.2 How is the P<sub>STOP</sub> formula different given other values for dir and adj?</h4>
  78 <div id="text-1.1.2">
  79
  80 <p>(Presumably, the P<sub>STOP</sub> formula where STOP is True is just the
  81 rule-probability of <u>h</u> -&gt; STOP h_ or h_ -&gt; h STOP, but how does
  82 adjacency fit in here?)
  83 </p>
  84 <p>
  85 (And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
  86 </p></div>
  87 </div>
  88
  89 </div>
  90
  91 <div id="outline-container-1.2" class="outline-3">
  92 <h3 id="sec-1.2">1.2 <span class="todo">TODO</span> P<sub>CHOOSE</sub></h3>
  93 <div id="text-1.2">
  94
  95 <p>Write the formulas! should be easy?
  96 </p></div>
  97
  98 </div>
  99
 100 <div id="outline-container-1.3" class="outline-3">
 101 <h3 id="sec-1.3">1.3 <span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
 102 <div id="text-1.3">
 103
 104 <p>Eg, first sort and count, so that the corpus
 105 [['nn','vbd','det','nn'],
 106 ['vbd','nn','det','nn'],
 107 ['nn','vbd','det','nn']]
 108 becomes
 109 [(['nn','vbd','det','nn'],2),
 110 (['vbd','nn','det','nn'],1)]
 111 and then in each loop through sentences, make sure we handle the
 112 frequency correctly.
 113 </p>
 114 <p>
 115 Is there much to gain here?
 116 </p></div>
 117
 118 </div>
 119
 120 <div id="outline-container-1.4" class="outline-3">
 121 <h3 id="sec-1.4">1.4 Initialization   </h3>
 122 <div id="text-1.4">
 123
 124 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
 125 we Do have to go through the corpus, since the probabilities are based
 126 on how far away in the sentence arguments are from their heads.
 127 </p>
 128 </div>
 129
 130 <div id="outline-container-1.4.1" class="outline-4">
 131 <h4 id="sec-1.4.1">1.4.1 <span class="done">DONE</span> DMV Initialization frequencies    </h4>
 132 <div id="text-1.4.1">
 133
 134 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
 135 </p>
 136 </div>
 137
 138 <div id="outline-container-1.4.1.1" class="outline-5">
 139 <h5 id="sec-1.4.1.1">1.4.1.1 P<sub>STOP</sub>    </h5>
 140 <div id="text-1.4.1.1">
 141
 142 <p>P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
 143 the sentence [det nn vb nn] is
 144 </p><ul>
 145 <li>
 146 f<sub>STOP</sub>( STOP|det, L, adj) +1
 147 </li>
 148 <li>
 149 f<sub>STOP</sub>(-STOP|det, L, adj) +0
 150 </li>
 151 <li>
 152 f<sub>STOP</sub>( STOP|det, L, non<sub>adj</sub>) +1
 153 </li>
 154 <li>
 155 f<sub>STOP</sub>(-STOP|det, L, non<sub>adj</sub>) +0
 156 </li>
 157 <li>
 158 f<sub>STOP</sub>( STOP|det, R, adj) +0
 159 </li>
 160 <li>
 161 f<sub>STOP</sub>(-STOP|det, R, adj) +1
 162
 163 </li>
 164 <li>
 165 f<sub>STOP</sub>( STOP|nn, L, adj) +0
 166 </li>
 167 <li>
 168 f<sub>STOP</sub>(-STOP|nn, L, adj) +1
 169 </li>
 170 <li>
 171 f<sub>STOP</sub>( STOP|nn, L, non<sub>adj</sub>) +1  # since there's at least one to the left
 172 </li>
 173 <li>
 174 f<sub>STOP</sub>(-STOP|nn, L, non<sub>adj</sub>) +0
 175
 176 </li>
 177 </ul></div>
 178
 179 </div>
 180
 181 <div id="outline-container-1.4.1.2" class="outline-5">
 182 <h5 id="sec-1.4.1.2">1.4.1.2 P<sub>CHOOSE</sub></h5>
 183 <div id="text-1.4.1.2">
 184
 185 <p>Go through the corpus, counting distances between heads and
 186 arguments. In [det nn vb nn], we give
 187 </p><ul>
 188 <li>
 189 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
 190 </li>
 191 <li>
 192 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
 193 </li>
 194 <li>
 195 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
 196 <ul>
 197 <li>
 198 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
 199 (1+1/3+2C) / sum<sub>a</sub> f<sub>CHOOSE</sub>(a|det, R)
 200
 201 </li>
 202 </ul></li>
 203 </ul>
 204 <p>The ROOT gets "each argument with equal probability", so in a sentence
 205 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 206 just a frequency count of the corpus&hellip;
 207 </p></div>
 208 </div>
 209
 210 </div>
 211
 212 <div id="outline-container-1.4.2" class="outline-4">
 213 <h4 id="sec-1.4.2">1.4.2 <span class="todo">TOGROK</span> DMV Initialization probabilities from initialization frequency    </h4>
 214 <div id="text-1.4.2">
 215
 216 </div>
 217
 218 </div>
 219
 220 <div id="outline-container-1.4.3" class="outline-4">
 221 <h4 id="sec-1.4.3">1.4.3 <span class="todo">TOGROK</span> CCM Initialization    </h4>
 222 <div id="text-1.4.3">
 223
 224 <p>P<sub>SPLIT</sub> used here&hellip; how, again?
 225 </p>
 226
 227
 228 </div>
 229 </div>
 230
 231 </div>
 232
 233 <div id="outline-container-1.5" class="outline-3">
 234 <h3 id="sec-1.5">1.5 <span class="todo">TOGROK</span> Adjacency and combining it with inner()</h3>
 235 <div id="text-1.5">
 236
 237 </div>
 238
 239 </div>
 240
 241 <div id="outline-container-1.6" class="outline-3">
 242 <h3 id="sec-1.6">1.6 <span class="todo">TOGROK</span> Is the E-step in DMV just inner() on the full sentence? </h3>
 243 <div id="text-1.6">
 244
 245 <p>and What exactly is the M-step of DMV?
 246 </p>
 247 </div>
 248
 249 </div>
 250
 251 <div id="outline-container-1.7" class="outline-3">
 252 <h3 id="sec-1.7">1.7 <span class="todo">TODO</span> [#C] Corpus access</h3>
 253 <div id="text-1.7">
 254
 255 </div>
 256
 257 </div>
 258
 259 <div id="outline-container-1.8" class="outline-3">
 260 <h3 id="sec-1.8">1.8 <span class="todo">TOGROK</span> How do we interpret DMV as an inside/outside process?   </h3>
 261 <div id="text-1.8">
 262
 263 <p>c<sub>s</sub>(x : i, j) is "the expected fraction of parses of s" with x from
 264 i to j; expectation then uses the probabilities gotten from
 265 initialization and previously gained probabilities, but these are of
 266 the form P<sub>STOP</sub> and P<sub>CHOOSE</sub>, how do we translate this to inside
 267 outside, which just uses the probabilities of CFG-rules?
 268 </p>
 269 </div>
 270
 271 </div>
 272
 273 <div id="outline-container-1.9" class="outline-3">
 274 <h3 id="sec-1.9">1.9 Technical: sentences or rules as the "outer loop"?</h3>
 275 <div id="text-1.9">
 276
 277 </div>
 278
 279 </div>
 280
 281 <div id="outline-container-1.10" class="outline-3">
 282 <h3 id="sec-1.10">1.10 CCM-stuff</h3>
 283 <div id="text-1.10">
 284
 285 </div>
 286
 287 </div>
 288
 289 <div id="outline-container-1.11" class="outline-3">
 290 <h3 id="sec-1.11">1.11 <span class="done">DONE</span> How do we know whether we are 'adjacent' or not? </h3>
 291 <div id="text-1.11">
 292
 293
 294 </div>
 295
 296 <div id="outline-container-1.11.0.1" class="outline-5">
 297 <h5 id="sec-1.11.0.1">1.11.0.1 One configuration that I'm fairly certain of: right w/CHOOSE</h5>
 298 <div id="text-1.11.0.1">
 299
 300 <p>if we have
 301 \Tree [<sub>b</sub> [<sub>b</sub> b <u>c</u> ] <u>d</u> ]
 302 then the lower tree [<sub>b</sub> b <u>c</u> ] is adjacent since, working your way up
 303 the tree, no argument has been created to the right "yet"; while the
 304 outer tree [<sub>b</sub> [<sub>b</sub> &hellip; ] <u>d</u> ] is non-adjacent, since there is something in
 305 between&hellip; Is it thus always adjacent to the right if the distance
 306 is 2? (That is, in e(s,t,i) for the adjacent rule: t - s == 2; while
 307 in the non_adj rule: t - s == 4)
 308 </p><ul>
 309 <li id="sec-1.11.0.1.1">Implementing this:<br/>
 310 Two different DMVRules? Or just two different prob-values per rule?
 311 </li>
 312 </ul>
 313 </div>
 314
 315 </div>
 316
 317 <div id="outline-container-1.11.0.2" class="outline-5">
 318 <h5 id="sec-1.11.0.2">1.11.0.2 left w/CHOOSE</h5>
 319 <div id="text-1.11.0.2">
 320
 321 <p>Same deal here?
 322 </p></div>
 323
 324 </div>
 325
 326 <div id="outline-container-1.11.0.3" class="outline-5">
 327 <h5 id="sec-1.11.0.3">1.11.0.3 R/L without CHOOSE, the "sealing operations"</h5>
 328 <div id="text-1.11.0.3">
 329
 330 <p><u>h</u> -&gt; STOP h_ and h_ -&gt; h STOP
 331 </p>
 332 <p>
 333 What is "adjacency" here? That t - s == 1?
 334 </p>
 335
 336
 337
 338 </div>
 339 </div>
 340 </div>
 341
 342 </div>
 343
 344 <div id="outline-container-2" class="outline-2">
 345 <h2 id="sec-2">2 Python-stuff</h2>
 346 <div id="text-2">
 347
 348 <ul>
 349 <li>
 350 <a href="src/pseudo.py">pseudo.py</a>
 351 </li>
 352 <li>
 353 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
 354 </li>
 355 <li>
 356 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
 357
 358
 359 </li>
 360 </ul>
 361 </div>
 362 </div>
 363 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
 364 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
 365 </p>
 366 <p class="date"> Date: 2008/05/27 20:07:11</p>
 367 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
 368 </html>