DMVCCM.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   2                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml"
   4 lang="en" xml:lang="en">
   5 <head>
   6 <title>DMV/CCM &ndash; todo-list / progress</title>
   7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
   8 <meta name="generator" content="Org-mode"/>
   9 <meta name="generated" content="2008-09-21 19:00:58 CEST"/>
  10 <meta name="author" content="Kevin Brubeck Unhammer"/>
  11 <style type="text/css">
  12   html { font-family: Times, serif; font-size: 12pt; }
  13   .title  { text-align: center; }
  14   .todo   { color: red; }
  15   .done   { color: green; }
  16   .tag    { background-color:lightblue; font-weight:normal }
  17   .target { }
  18   .timestamp { color: grey }
  19   .timestamp-kwd { color: CadetBlue }
  20   p.verse { margin-left: 3% }
  21   pre {
  22         border: 1pt solid #AEBDCC;
  23         background-color: #F3F5F7;
  24         padding: 5pt;
  25         font-family: courier, monospace;
  26         font-size: 90%;
  27         overflow:auto;
  28   }
  29   table { border-collapse: collapse; }
  30   td, th { vertical-align: top; }
  31   dt { font-weight: bold; }
  32 </style><link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
  33 <!-- override with local style.css: -->
  34 <link rel="stylesheet" type="text/css" href="./style.css">
  35 </head><body>
  36 <h1 class="title">DMV/CCM &ndash; todo-list / progress</h1>
  37 <div id="table-of-contents">
  38 <h2>Table of Contents</h2>
  39 <div id="text-table-of-contents">
  40 <ul>
  41 <li><a href="#sec-1">1 DMV/CCM report and project</a></li>
  42 <li><a href="#sec-2">2 Notation</a></li>
  43 <li><a href="#sec-3">3 Testing the dependency parsed WSJ</a>
  44 <ul>
  45 <li><a href="#sec-3.1">3.1 [#A] Should <code>def evaluate</code> use add_root?</a></li>
  46 </ul>
  47 </li>
  48 <li><a href="#sec-4">4 Combine CCM with DMV</a></li>
  49 <li><a href="#sec-5">5 Reestimate P_ORDER ?</a></li>
  50 <li><a href="#sec-6">6 Most Probable Parse</a>
  51 <ul>
  52 <li><a href="#sec-6.1">6.1 Find MPP with CCM</a></li>
  53 <li><a href="#sec-6.2">6.2 Find Most Probable Parse of given test sentence, in DMV</a></li>
  54 </ul>
  55 </li>
  56 <li><a href="#sec-7">7 Initialization   </a>
  57 <ul>
  58 <li><a href="#sec-7.1">7.1 CCM Initialization    </a></li>
  59 </ul>
  60 </li>
  61 <li><a href="#sec-8">8 [#C] Alternative CNF for DMV</a>
  62 <ul>
  63 <li><a href="#sec-8.1">8.1 [#A] Make and implement an equivalent grammar that's <i>pure</i> CNF</a></li>
  64 <li><a href="#sec-8.2">8.2 [#A] convert L&amp;Y-based reestimation into P_ATTACH and P_STOP values</a></li>
  65 <li><a href="#sec-8.3">8.3 [#C] move as much as possible into common_dmv.py</a></li>
  66 <li><a href="#sec-8.4">8.4 L&amp;Y-based reestimation for cnf_dmv</a></li>
  67 <li><a href="#sec-8.5">8.5 dmv2cnf re-estimation formulas</a></li>
  68 <li><a href="#sec-8.6">8.6 inner and outer for cnf_dmv.py, also cnf_harmonic.py </a></li>
  69 </ul>
  70 </li>
  71 <li><a href="#sec-9">9 [#C] Deferred</a>
  72 <ul>
  73 <li><a href="#sec-9.1">9.1 Clean up reestimation code</a></li>
  74 <li><a href="#sec-9.2">9.2 [#A] compare speed of w_left/right(&hellip;) and w(LEFT/RIGHT, &hellip;)</a></li>
  75 <li><a href="#sec-9.3">9.3 when reestimating P_STOP etc, remove rules with p &lt; epsilon</a></li>
  76 <li><a href="#sec-9.4">9.4 inner_dmv, short ranges and impossible attachment</a></li>
  77 <li><a href="#sec-9.5">9.5 clean up the module files</a></li>
  78 <li><a href="#sec-9.6">9.6 Some (tagged) sentences are bound to come twice</a></li>
  79 <li><a href="#sec-9.7">9.7 tags as numbers or tags as strings?</a></li>
  80 </ul>
  81 </li>
  82 <li><a href="#sec-10">10 Adjacency and combining it with the inside-outside algorithm</a>
  83 <ul>
  84 <li><a href="#sec-10.1">10.1 Possible alternate type of adjacency</a></li>
  85 </ul>
  86 </li>
  87 <li><a href="#sec-11">11 Python-stuff</a></li>
  88 <li><a href="#sec-12">12 Git</a></li>
  89 </ul>
  90 </div>
  91 </div>
  92
  93 <div id="outline-container-1" class="outline-2">
  94 <h2 id="sec-1">1 DMV/CCM report and project</h2>
  95 <div id="text-1">
  96
  97 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-09-21 Sun</span><br/>
  98 </p><ul>
  99 <li>
 100 <a href="http://www.student.uib.no/~kun041/dmvccm/report.pdf">report.pdf</a> &ndash; Draft report for the whole project, including formulas
 101 for the full algorithms
 102
 103 </li>
 104 <li>
 105 <a href="src/main.py">main.py</a> &ndash; evaluation, corpus likelihoods
 106 </li>
 107 <li>
 108 <a href="src/wsjdep.py">wsjdep.py</a> &ndash; corpus reader for the dependency parsed WSJ
 109
 110 </li>
 111 <li>
 112 <a href="src/loc_h_dmv.py">loc_h_dmv.py</a> &ndash; DMV-IO and reestimation
 113 </li>
 114 <li>
 115 <a href="src/loc_h_harmonic.py">loc_h_harmonic.py</a> &ndash; DMV initialization
 116
 117 </li>
 118 <li>
 119 <a href="src/common_dmv.py">common_dmv.py</a> &ndash; various functions used by loc_h_dmv and others
 120 </li>
 121 <li>
 122 <a href="src/io.py">io.py</a> &ndash; non-DMV IO
 123
 124 </li>
 125 </ul>
 126
 127 <p>Deprecated:
 128 </p><ul>
 129 <li>
 130 <a href="src/cnf_dmv.py">cnf_dmv.py</a> &ndash; cnf-like implementation of DMV
 131 </li>
 132 <li>
 133 <a href="src/cnf_harmonic.py">cnf_harmonic.py</a> &ndash; initialization for cnf_dmv
 134
 135 </li>
 136 </ul>
 137
 138 <p><a href="http://www.student.uib.no/~kun041/dmvccm/DMVCCM_archive.html">Archived entries</a> from this file.
 139 </p></div>
 140
 141 </div>
 142
 143 <div id="outline-container-2" class="outline-2">
 144 <h2 id="sec-2">2 Notation</h2>
 145 <div id="text-2">
 146
 147 <p><pre class="example">
 148  old notes:   new notes:   in tex/code (constants):    in Klein thesis:
 149 --------------------------------------------------------------------------------------
 150  _h_            _h_            SEAL                    bar over h
 151   h_             h&gt;&lt;           RGOL                    right-under-left-arrow over h
 152   h              h&gt;            GOR                     right-arrow over h
 153
 154                &gt;&lt;h             LGOR                    left-under-right-arrow over h
 155                 &lt;h             GOL                     left-arrow over h
 156 </pre>
 157 These are represented in the code as pairs <code>(s_h,h)</code>, where <code>h</code> is an
 158 integer (POS-tag) and <code>s_h</code> &isin; <code>{SEAL,RGOL,GOR,LGOR,GOL}</code>.
 159 </p>
 160 <p>
 161 <code>P_ATTACH</code> and <code>P_CHOOSE</code> are synonymous, I try to use the
 162 former. Also,
 163 <pre class="example">
 164  P_GO_AT(a|h,dir,adj) := P_ATTACH(a|h,dir)*(1-P_STOP(STOP|h,dir,adj)
 165 </pre>
 166 </p>
 167 <p>
 168 (precalculated after each reestimation with <code>g.p_GO_AT = make_GO_AT(g.p_STOP,g.p_ATTACH)</code>)
 169 </p>
 170 </div>
 171
 172 </div>
 173
 174 <div id="outline-container-3" class="outline-2">
 175 <h2 id="sec-3">3 Testing the dependency parsed WSJ</h2>
 176 <div id="text-3">
 177
 178 <p><a href="src/wsjdep.py">wsjdep.py</a> uses NLTK (sort of) to get a dependency parsed version of
 179 WSJ10 into the format used in mpp() in loc_h_dmv.py.
 180 </p>
 181 <p>
 182 As a default, <code>WSJDepCorpusReader</code> looks for the file <code>wsj.combined.10.dep</code> in
 183 <code>../corpus/wsjdep</code>.
 184 </p>
 185 <p>
 186 Only <code>sents()</code>, <code>tagged_sents()</code> and <code>parsed_sents()</code> (plus a new function
 187 <code>tagonly_sents()</code>) are implemented, the other NLTK corpus functions are
 188 ..um.. undefined&hellip;
 189 </p>
 190 </div>
 191
 192 <div id="outline-container-3.1" class="outline-3">
 193 <h3 id="sec-3.1">3.1 <span class="todo">TODO</span> [#A] Should <code>def evaluate</code> use add_root?</h3>
 194 <div id="text-3.1">
 195
 196 <p><a href="src/main.py">main.py</a> evaluate
 197 <a href="src/wsjdep.py">wsjdep.py</a> add_root
 198 </p>
 199 <p>
 200 (just has to count how many pairs are in there; Precision and Recall)
 201 </p></div>
 202 </div>
 203
 204 </div>
 205
 206 <div id="outline-container-4" class="outline-2">
 207 <h2 id="sec-4">4 <span class="todo">TOGROK</span> Combine CCM with DMV</h2>
 208 <div id="text-4">
 209
 210
 211 <p>
 212 <a name="comboquestions">&nbsp;</a>
 213 </p>
 214 <p>
 215 Questions about the <code>P_COMBO</code> info in <a href="http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf">Klein's thesis</a>:
 216 </p><ul>
 217 <li>
 218 Page 109 (pdf: 125): We have to premultiply "all our probabilities"
 219 by the CCM base product <i>&Pi;<sub>&lt;i,j&gt;</sub>   P<sub>SPAN</sub>(&alpha;(i,j,s)|false)P<sub>CONTEXT</sub>(&beta;(i,j,s)|false)</i>; which
 220 probabilities are included under "all"? I'm assuming this includes
 221 <code>P_ATTACH</code> since each time <code>P_ATTACH</code> is used, <i>&phi;</i> is multiplied in
 222 (pp.110-111 ibid.); but <i>&phi;</i> is not used for STOPs, so should we not
 223 have our CCM product multiplied in there? How about <code>P_ROOT</code>?
 224 (Guessing <code>P_ORDER</code> is way out of the question&hellip;)
 225 </li>
 226 <li>
 227 For the outside probabilities, is it correct to assume we multiply
 228 in <i>&phi;(j,k)</i> or <i>&phi;(k,i)</i> when calculating <code>inner(i,j...)</code>? (Eg., only
 229 for the outside part, not for the whole range.) I don't understand
 230 the notation in <code>O()</code> on p.103.
 231 </li>
 232 </ul>
 233 </div>
 234
 235 </div>
 236
 237 <div id="outline-container-5" class="outline-2">
 238 <h2 id="sec-5">5 <span class="todo">TOGROK</span> Reestimate P_ORDER ?</h2>
 239 <div id="text-5">
 240
 241 </div>
 242
 243 </div>
 244
 245 <div id="outline-container-6" class="outline-2">
 246 <h2 id="sec-6">6 Most Probable Parse</h2>
 247 <div id="text-6">
 248
 249
 250 </div>
 251
 252 <div id="outline-container-6.1" class="outline-3">
 253 <h3 id="sec-6.1">6.1 <span class="todo">TOGROK</span> Find MPP with CCM</h3>
 254 <div id="text-6.1">
 255
 256 </div>
 257
 258 </div>
 259
 260 <div id="outline-container-6.2" class="outline-3">
 261 <h3 id="sec-6.2">6.2 <span class="done">DONE</span> Find Most Probable Parse of given test sentence, in DMV</h3>
 262 <div id="text-6.2">
 263
 264 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-07-23 Wed 10:56</span><br/>
 265 inner() optionally keeps track of the highest probability children of
 266 any node in <code>mpptree</code>. Say we're looking for <code>inner(i,j,(s_h,h),loc_h)</code> in
 267 a certain sentence, and we find some possible left and right children,
 268 we add to <code>mpptree[i,j,(s_h,h),loc_h]</code> the triple <code>(p, L, R)</code> where <code>L</code> and
 269 <code>R</code> are of the same form as the key (<code>i,j,(s_h,h),loc_h</code>) and <code>p</code> is the
 270 probability of this node rewriting to <code>L</code> and <code>R</code>,
 271 eg. <code>inner(L)*inner(R)*p_GO_AT</code> or <code>p_STOP</code> or whatever. We only add this
 272 entry to <code>mpptree</code> if there wasn't a higher-probability entry there
 273 before.
 274 </p>
 275 <p>
 276 Then, after <code>inner_sent</code> makes an <code>mpptree</code>, we find the <i>relevant</i>
 277 head-argument pairs by searching through the tree using a queue,
 278 adding the <code>L</code> and <code>R</code> keys of any entry to the queue as we find them
 279 (skipping <code>STOP</code> keys), and adding any attachment entries to a set of
 280 triples <code>(head,argument,dir)</code>. Thus we have our most probable parse,
 281 eg.
 282 <pre class="example">
 283  set([( ROOT, (vbd,2),RIGHT),
 284       ((vbd,2),(nn,1),LEFT),
 285       ((vbd,2),(nn,3),RIGHT),
 286       ((nn,1),(det,0),LEFT)])
 287 </pre>
 288 </p></div>
 289 </div>
 290
 291 </div>
 292
 293 <div id="outline-container-7" class="outline-2">
 294 <h2 id="sec-7">7 Initialization   </h2>
 295 <div id="text-7">
 296
 297 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
 298 </p>
 299 <p>
 300 We go through the corpus, since the probabilities are based on how far
 301 away in the sentence arguments are from their heads.
 302 </p>
 303 </div>
 304
 305 <div id="outline-container-7.1" class="outline-3">
 306 <h3 id="sec-7.1">7.1 <span class="todo">TOGROK</span> CCM Initialization    </h3>
 307 <div id="text-7.1">
 308
 309 <p>P<sub>SPLIT</sub> used here&hellip; how, again?
 310 </p></div>
 311 </div>
 312
 313 </div>
 314
 315 <div id="outline-container-8" class="outline-2">
 316 <h2 id="sec-8">8 <span class="todo">TODO</span> [#C] Alternative CNF for DMV</h2>
 317 <div id="text-8">
 318
 319
 320 <p>
 321 <a name="dmv2cnf">&nbsp;</a>
 322 </p><ul>
 323 <li>
 324 <a href="src/cnf_dmv.py">cnf_dmv.py</a>
 325 </li>
 326 <li>
 327 <a href="src/cnf_harmonic.py">cnf_harmonic.py</a>
 328
 329 </li>
 330 </ul>
 331
 332 <p>See section 5 of <a href="tex/formulas.pdf">formulas.pdf</a>.
 333 </p>
 334 <p>
 335 Given a grammar with certain p_ATTACH, p_STOP and p_ROOT, we get:
 336 <pre class="example">
 337 &gt;&gt;&gt; print testgrammar_h():
 338   h&gt;&lt; --&gt;   h&gt;  STOP   [0.30]
 339   h&gt;&lt; --&gt;  &gt;h&gt;  STOP   [0.40]
 340  _h_  --&gt; STOP    h&gt;&lt;  [1.00]
 341  _h_  --&gt; STOP   &lt;h&gt;&lt;  [1.00]
 342  &gt;h&gt;  --&gt;   h&gt;   _h_   [1.00]
 343  &gt;h&gt;  --&gt;  &gt;h&gt;   _h_   [1.00]
 344  &lt;h&gt;&lt; --&gt;  _h_    h&gt;&lt;  [0.70]
 345  &lt;h&gt;&lt; --&gt;  _h_   &lt;h&gt;&lt;  [0.60]
 346 ROOT  --&gt; STOP   _h_   [1.00]
 347 </pre>
 348 </p>
 349
 350 </div>
 351
 352 <div id="outline-container-8.1" class="outline-3">
 353 <h3 id="sec-8.1">8.1 <span class="todo">TODO</span> [#A] Make and implement an equivalent grammar that's <i>pure</i> CNF</h3>
 354 <div id="text-8.1">
 355
 356 <p>&hellip;since I'm not sure about my unary reestimation rules (section 5 of
 357 <a href="tex/formulas.pdf">formulas</a>).
 358 </p>
 359 <p>
 360 For any rule where LHS is <code>_h_</code> we also have a corresponding one with
 361 LHS <code>ROOT</code>, only difference being that we multiply in <code>p_ROOT(h)</code>.
 362 </p>
 363 <p>
 364 For any rule where LHS is <code>.h&gt;</code>, we use adjacent probabilities for the
 365 left child; if LHS is <code>&lt;h.</code> we use adjacent probabilities for the right
 366 child. Only <code>_h_</code> and <code>_h&gt;_</code> (plus <code>ROOT</code>) get to introduce the pre-terminal
 367 <code>h</code> (where <code>h</code>, <code>ROOT</code> and <code>_h_</code> all rewrite to the terminal
 368 <code>'h'</code>), and only <code>_h_</code> and <code>_h&gt;_</code> (plus <code>ROOT</code>) act as STOP
 369 rules (eg. get to multiply in <code>p(STOP)</code>).
 370 </p>
 371 <p>
 372 <pre class="example">
 373   h   --&gt;  'h'         1
 374  _h_  --&gt;  'h'         p(STOP|h,L,adj) * p(STOP|h,R,adj)
 375  ROOT --&gt;  'h'         p(STOP|h,L,adj) * p(STOP|h,R,adj) * p_ROOT(h)
 376
 377  _h_  --&gt;   h    _a_   p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
 378  _h_  --&gt;   h    .h&gt;   p(STOP|h,L,adj) * p(STOP|h,R,non)
 379  .h&gt;  --&gt;  _a_   _b_   p(a|h,R)*p(-STOP|h,R,adj) * p(b|h,R)*p(-STOP|h,R,non)
 380  .h&gt;  --&gt;  _a_    h&gt;   p(a|h,R)*p(-STOP|h,R,adj)
 381   h&gt;  --&gt;  _a_   _b_   p(a|h,R)*p(-STOP|h,R,non) * p(b|h,R)*p(-STOP|h,R,non)
 382   h&gt;  --&gt;  _a_    h&gt;   p(a|h,R)*p(-STOP|h,R,non)
 383
 384  _h_  --&gt;  _a_    h    p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj)
 385  _h_  --&gt;  &lt;h.    h    p(STOP|h,L,non) * p(STOP|h,R,adj)
 386  &lt;h.  --&gt;  _b_   _a_   p(b|h,L)*p(-STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
 387  &lt;h.  --&gt;  &lt;h    _a_                               p(a|h,L)*p(-STOP|h,L,adj)
 388  &lt;h   --&gt;  _a_   _b_   p(a|h,L)*p(-STOP|h,L,non) * p(b|h,L)*p(-STOP|h,L,non)
 389  &lt;h   --&gt;  &lt;h    _a_   p(a|h,L)*p(-STOP|h,L,non)
 390
 391  _h_  --&gt;  &lt;h.    _h&gt;_ p(STOP|h,L,non)
 392  _h_  --&gt;  _a_    _h&gt;_ p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
 393  _h&gt;_ --&gt;   h     .h&gt;  p(STOP|h,R,non)
 394  _h&gt;_ --&gt;   h     _a_  p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
 395
 396  ROOT --&gt;   h    _a_   p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj) * p_ROOT(h)
 397  ROOT --&gt;   h    .h&gt;   p(STOP|h,L,adj) * p(STOP|h,R,non) * p_ROOT(h)
 398
 399  ROOT --&gt;  _a_    h    p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
 400  ROOT --&gt;  &lt;h.    h    p(STOP|h,L,non) * p(STOP|h,R,adj) * p_ROOT(h)
 401
 402  ROOT --&gt;  &lt;h.   _h&gt;_  p(STOP|h,L,non) * p_ROOT(h)
 403  ROOT --&gt;  _a_   _h&gt;_  p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
 404
 405 </pre>
 406 </p>
 407 <p>
 408 Since we have rules rewriting <code>h</code> to <code>a</code> and <code>b</code>, we have a rule-set
 409 numbering more than n<sub>tags</sub><sup>2</sup>.
 410 </p>
 411 </div>
 412
 413 </div>
 414
 415 <div id="outline-container-8.2" class="outline-3">
 416 <h3 id="sec-8.2">8.2 <span class="todo">TOGROK</span> [#A] convert L&amp;Y-based reestimation into P_ATTACH and P_STOP values</h3>
 417 <div id="text-8.2">
 418
 419 <p>Sum over the various rules? Or something? Must think of this.
 420 </p></div>
 421
 422 </div>
 423
 424 <div id="outline-container-8.3" class="outline-3">
 425 <h3 id="sec-8.3">8.3 <span class="todo">TODO</span> [#C] move as much as possible into common_dmv.py</h3>
 426 <div id="text-8.3">
 427
 428 <p><a href="src/common_dmv.py">common_dmv.py</a>
 429 </p></div>
 430
 431 </div>
 432
 433 <div id="outline-container-8.4" class="outline-3">
 434 <h3 id="sec-8.4">8.4 <span class="done">DONE</span> L&amp;Y-based reestimation for cnf_dmv</h3>
 435 <div id="text-8.4">
 436
 437 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-08-21 Thu 16:35</span><br/>
 438 </p></div>
 439
 440 </div>
 441
 442 <div id="outline-container-8.5" class="outline-3">
 443 <h3 id="sec-8.5">8.5 <span class="done">DONE</span> dmv2cnf re-estimation formulas</h3>
 444 <div id="text-8.5">
 445
 446 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-08-21 Thu 16:36</span><br/>
 447 </p></div>
 448
 449 </div>
 450
 451 <div id="outline-container-8.6" class="outline-3">
 452 <h3 id="sec-8.6">8.6 <span class="done">DONE</span> inner and outer for cnf_dmv.py, also cnf_harmonic.py </h3>
 453 <div id="text-8.6">
 454
 455 </div>
 456 </div>
 457
 458 </div>
 459
 460 <div id="outline-container-9" class="outline-2">
 461 <h2 id="sec-9">9 [#C] Deferred</h2>
 462 <div id="text-9">
 463
 464 <p><a href="http://wiki.python.org/moin/PythonSpeed/PerformanceTips">http://wiki.python.org/moin/PythonSpeed/PerformanceTips</a> Eg., use
 465 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
 466 for-loops; use local variables for globals (global variables or or
 467 functions), etc.
 468 </p>
 469 </div>
 470
 471 <div id="outline-container-9.1" class="outline-3">
 472 <h3 id="sec-9.1">9.1 <span class="todo">TODO</span> Clean up reestimation code                                    &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span></h3>
 473 <div id="text-9.1">
 474
 475 </div>
 476
 477 </div>
 478
 479 <div id="outline-container-9.2" class="outline-3">
 480 <h3 id="sec-9.2">9.2 <span class="todo">TODO</span> [#A] compare speed of w_left/right(&hellip;) and w(LEFT/RIGHT, &hellip;) &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
 481 <div id="text-9.2">
 482
 483 </div>
 484
 485 </div>
 486
 487 <div id="outline-container-9.3" class="outline-3">
 488 <h3 id="sec-9.3">9.3 <span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon   &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
 489 <div id="text-9.3">
 490
 491 </div>
 492
 493 </div>
 494
 495 <div id="outline-container-9.4" class="outline-3">
 496 <h3 id="sec-9.4">9.4 <span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
 497 <div id="text-9.4">
 498
 499 <p>If s-t &lt;= 2, there can be only one attachment below, so don't recurse
 500 with both Lattach=True and Rattach=True.
 501 </p>
 502 <p>
 503 If s-t &lt;= 1, there can be no attachment below, so only recurse with
 504 Lattach=False, Rattach=False.
 505 </p>
 506 <p>
 507 Put this in the loop under rewrite rules (could also do it in the STOP
 508 section, but that would only have an effect on very short sentences).
 509 </p></div>
 510
 511 </div>
 512
 513 <div id="outline-container-9.5" class="outline-3">
 514 <h3 id="sec-9.5">9.5 <span class="todo">TODO</span> clean up the module files                                     &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span></h3>
 515 <div id="text-9.5">
 516
 517 <p>Is there better way to divide dmv and harmonic? There's a two-way
 518 dependency between the modules. Guess there could be a third file that
 519 imports both the initialization and the actual EM stuff, while a file
 520 containing constants and classes could be imported by all others:
 521 <pre class="example">
 522  dmv.py imports dmv_EM.py imports dmv_classes.py
 523  dmv.py imports dmv_inits.py imports dmv_classes.py
 524 </pre>
 525 </p>
 526 </div>
 527
 528 </div>
 529
 530 <div id="outline-container-9.6" class="outline-3">
 531 <h3 id="sec-9.6">9.6 <span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
 532 <div id="text-9.6">
 533
 534 <p>Eg, first sort and count, so that the corpus
 535 [['nn','vbd','det','nn'],
 536 ['vbd','nn','det','nn'],
 537 ['nn','vbd','det','nn']]
 538 becomes
 539 [(['nn','vbd','det','nn'],2),
 540 (['vbd','nn','det','nn'],1)]
 541 and then in each loop through sentences, make sure we handle the
 542 frequency correctly.
 543 </p>
 544 <p>
 545 Is there much to gain here?
 546 </p>
 547 </div>
 548
 549 </div>
 550
 551 <div id="outline-container-9.7" class="outline-3">
 552 <h3 id="sec-9.7">9.7 <span class="todo">TOGROK</span> tags as numbers or tags as strings?                         &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
 553 <div id="text-9.7">
 554
 555 <p>Need to clean up the representation.
 556 </p>
 557 <p>
 558 Stick with tag-strings in initialization then switch to numbers for
 559 IO-algorithm perhaps? Can probably afford more string-matching in
 560 initialization..
 561 </p></div>
 562 </div>
 563
 564 </div>
 565
 566 <div id="outline-container-10" class="outline-2">
 567 <h2 id="sec-10">10 Adjacency and combining it with the inside-outside algorithm</h2>
 568 <div id="text-10">
 569
 570 <p>Each DMV probability (for a certain PCFG node) has both an adjacent
 571 and a non-adjacent probability. inner() and outer() needs the correct
 572 one in each case.
 573 </p>
 574 <p>
 575 In each inner() call, loc_h is the location of the head of this
 576 dependency structure. In each outer() call, it's the head of the <i>Node</i>,
 577 the structure we're looking outside of.
 578 </p>
 579 <p>
 580 We call inner() for each location of a head, and on each terminal,
 581 loc_h must equal <code>i</code> (and <code>loc_h+1</code> equal <code>j</code>). In the recursive attachment
 582 calls, we use the locations (sentence indices) of words to the left or
 583 right of the head in calls to inner(). <i>loc_h lets us check whether we need probN or probA</i>.
 584 </p>
 585 </div>
 586
 587 <div id="outline-container-10.1" class="outline-3">
 588 <h3 id="sec-10.1">10.1 Possible alternate type of adjacency</h3>
 589 <div id="text-10.1">
 590
 591 <p>K&amp;M's adjacency is just whether or not an argument has been generated
 592 in the current direction yet. One could also make a stronger type of
 593 adjacency, where h and a are not adjacent if b is in between, eg. with
 594 the sentence "a b h" and the structure ((h-&gt;a), (a-&gt;b)), h is
 595 K&amp;M-adjacent to a, but not next to a, since b is in between. It's easy
 596 to check this type of adjacency in inner(), but it needs new rules for
 597 P_STOP reestimation.
 598 </p></div>
 599 </div>
 600
 601 </div>
 602
 603 <div id="outline-container-11" class="outline-2">
 604 <h2 id="sec-11">11 Python-stuff</h2>
 605 <div id="text-11">
 606
 607 <p>Make those debug statements steal a bit less attention in emacs:
 608 <pre class="example">
 609 (font-lock-add-keywords
 610  'python-mode                   ; not really regexp, a bit slow
 611  '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
 612 \\1    .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
 613 (font-lock-add-keywords
 614  'python-mode
 615  '(("\\&lt;\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
 616 </pre>
 617 </p>
 618 <ul>
 619 <li>
 620 <a href="src/pseudo.py">pseudo.py</a>
 621 </li>
 622 <li>
 623 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
 624 </li>
 625 <li>
 626 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
 627 </li>
 628 <li>
 629 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
 630
 631
 632
 633 </li>
 634 </ul>
 635 </div>
 636
 637 </div>
 638
 639 <div id="outline-container-12" class="outline-2">
 640 <h2 id="sec-12">12 Git</h2>
 641 <div id="text-12">
 642
 643 <p>Repository web page: <a href="http://repo.or.cz/w/dmvccm.git">http://repo.or.cz/w/dmvccm.git</a>
 644 </p>
 645 <p>
 646 Setting up a new project:
 647 <pre class="example">
 648  git init
 649  git add .
 650  git commit -m "first release"
 651 </pre>
 652 </p>
 653 <p>
 654 Later on: (<code>-a</code> does <code>git rm</code> and <code>git add</code> automatically)
 655 <pre class="example">
 656  git init
 657  git commit -a -m "some subsequent release"
 658 </pre>
 659 </p>
 660 <p>
 661 Then push stuff up to the remote server:
 662 <pre class="example">
 663  git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
 664 </pre>
 665 </p>
 666 <p>
 667 (<code>eval `ssh-agent`</code> and <code>ssh-add</code> to avoid having to type in keyphrase all
 668 the time)
 669 </p>
 670 <p>
 671 Make a copy of the (remote) master branch:
 672 <pre class="example">
 673  git clone git://repo.or.cz/dmvccm.git
 674 </pre>
 675 </p>
 676 <p>
 677 Make and name a new branch in this folder
 678 <pre class="example">
 679  git checkout -b mybranch
 680 </pre>
 681 </p>
 682 <p>
 683 To save changes in <code>mybranch</code>:
 684 <pre class="example">
 685  git commit -a
 686 </pre>
 687 </p>
 688 <p>
 689 Go back to the master branch (uncommitted changes from <code>mybranch</code> are
 690 carried over):
 691 <pre class="example">
 692  git checkout master
 693 </pre>
 694 </p>
 695 <p>
 696 Try out:
 697 <pre class="example">
 698  git add --interactive
 699 </pre>
 700 </p>
 701 <p>
 702 Good tutorial:
 703 <a href="http://www-cs-students.stanford.edu/~blynn//gitmagic/">http://www-cs-students.stanford.edu/~blynn//gitmagic/</a>
 704 </p></div>
 705 </div>
 706 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
 707 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
 708 </p>
 709 <p class="date"> Date: 2008-09-21 19:00:58 CEST</p>
 710 <p>HTML generert av <a href='http://orgmode.org/'>org-mode</a> 6.06b in emacs 22<p>
 711 </div><script src="./post-script.js" type="text/JavaScript">
 712 </script></body>
 713 </html>