DMVCCM.org_archive

   1 # -*- mode: org -*-
   2 # -*- coding: mule-utf-8-unix -*-
   3
   4 #+STARTUP: overview
   5 #+TAGS: OPTIMIZE PRETTIER
   6 #+STARTUP: hidestars
   7 #+TITLE: DMV/CCM -- todo-list / progress ARCHIVED ENTRIES
   8 #+AUTHOR: Kevin Brubeck Unhammer
   9 #+EMAIL: K.BrubeckUnhammer at student uva nl
  10 #+OPTIONS: ^:{}
  11 #+LANGUAGE: en
  12 #+SEQ_TODO: TOGROK TODO DONE
  13
  14
  15 Archived entries from file /Users/kiwibird/dmvccm/DMVCCM.org
  16 * DONE [#A] test and debug my brilliant idea
  17   CLOSED: [2008-06-08 Sun 10:28]
  18   :PROPERTIES:
  19   :ARCHIVE_TIME: 2008-06-08 Sun 12:55
  20   :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
  21   :ARCHIVE_OLPATH: Adjacency and combining it with inner()
  22   :ARCHIVE_CATEGORY: DMVCCM
  23   :ARCHIVE_TODO: DONE
  24   :END:
  25 * DONE implement my brilliant idea.
  26     CLOSED: [2008-06-01 Sun 17:19]
  27   :PROPERTIES:
  28     :ARCHIVE_TIME: 2008-06-08 Sun 12:55
  29     :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
  30     :ARCHIVE_OLPATH: Adjacency and combining it with inner()
  31     :ARCHIVE_CATEGORY: DMVCCM
  32     :ARCHIVE_TODO: DONE
  33   :END:
  34 [[file:src/dmv.py::def%20e%20s%20t%20LHS%20Lattach%20Rattach][e(sti) in dmv.py]]
  35
  36 * DONE [#A] test inner() on sentences with duplicate words
  37   :PROPERTIES:
  38   :ARCHIVE_TIME: 2008-06-08 Sun 12:55
  39   :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
  40   :ARCHIVE_OLPATH: Adjacency and combining it with inner()
  41   :ARCHIVE_CATEGORY: DMVCCM
  42   :ARCHIVE_TODO: DONE
  43   :END:
  44 Works with eg. the sentence "h h h"
  45 * DONE [#A] How do we only count from completed trees?
  46    CLOSED: [2008-06-13 Fri 11:40]
  47   :PROPERTIES:
  48    :ARCHIVE_TIME: 2008-06-15 Sun 23:52
  49    :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
  50    :ARCHIVE_OLPATH: P_STOP and P_CHOOSE for IO/EM (reestimation)
  51    :ARCHIVE_CATEGORY: DMVCCM
  52    :ARCHIVE_TODO: DONE
  53   :END:
  54 Use c(s,t,Node); inner * outer / P_sent
  55
  56 * DONE [#A] c(s,t,Node)
  57   CLOSED: [2008-06-13 Fri 11:38]
  58   :PROPERTIES:
  59   :ARCHIVE_TIME: 2008-06-15 Sun 23:52
  60   :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
  61   :ARCHIVE_OLPATH: P_STOP and P_CHOOSE for IO/EM (reestimation)
  62   :ARCHIVE_CATEGORY: DMVCCM
  63   :ARCHIVE_TODO: DONE
  64   :END:
  65 = inner * outer / P_sent
  66
  67 implemented as inner * outer / inner_sent
  68 * DONE if loc_h == t, no need to try right-attachment rules &v.v.     :OPTIMIZE:
  69    CLOSED: [2008-06-10 Tue 14:34]
  70   :PROPERTIES:
  71    :ARCHIVE_TIME: 2008-06-15 Sun 23:52
  72    :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
  73    :ARCHIVE_OLPATH: Deferred
  74    :ARCHIVE_CATEGORY: DMVCCM
  75    :ARCHIVE_TODO: DONE
  76   :END:
  77 (and if loc_h == s, no need to try left-attachment rules.)
  78
  79 Modest speed increase (5%).
  80 * DONE io.debug parameters should not call functions                  :OPTIMIZE:
  81    CLOSED: [2008-06-10 Tue 12:26]
  82   :PROPERTIES:
  83    :ARCHIVE_TIME: 2008-06-15 Sun 23:52
  84    :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
  85    :ARCHIVE_OLPATH: Deferred
  86    :ARCHIVE_CATEGORY: DMVCCM
  87    :ARCHIVE_TODO: DONE
  88   :END:
  89 Exchanged all io.debug(str,'level') calls with statements of the form:
  90 :if 'level' in io.DEBUG:
  91 :    print str
  92
  93 and got an almost threefold speed increase on inner().
  94 * DONE inner_dmv() should disregard rules with heads not in sent      :OPTIMIZE:
  95    CLOSED: [2008-06-08 Sun 10:18]
  96   :PROPERTIES:
  97    :ARCHIVE_TIME: 2008-06-15 Sun 23:52
  98    :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
  99    :ARCHIVE_OLPATH: Deferred
 100    :ARCHIVE_CATEGORY: DMVCCM
 101    :ARCHIVE_TODO: DONE
 102   :END:
 103 If the sentence is "nn vbd det nn", we should not even look at rules
 104 where
 105 : rule.head() not in "nn vbd det nn".split()
 106 This is ruled out by getting rules from g.rules(LHS, sent).
 107
 108 Also, we optimize this further by saying we don't even recurse into
 109 attachment rules where
 110 : rule.head() not in sent[ s :r+1]
 111 : rule.head() not in sent[r+1:t+1]
 112 meaning, if we're looking at the span "vbd det", we only use
 113 attachment rules where both daughters are members of ['vbd','det']
 114 (although we don't (yet) care about removing rules that rewrite to the
 115 same tag if there are no duplicate tags in the span, etc., that would
 116 be a lot of trouble for little potential gain).
 117 * DONE Problem with this formula:
 118   :PROPERTIES:
 119   :ARCHIVE_TIME: 2008-06-23 Mon 15:10
 120   :ARCHIVE_FILE: ~/V08/Probability/dmvccm/DMVCCM.org
 121   :ARCHIVE_OLPATH: P_STOP and P_CHOOSE for IO/EM (reestimation)/Implement P_CHOOSE formula.
 122   :ARCHIVE_CATEGORY: DMVCCM
 123   :ARCHIVE_TODO: TOGROK
 124   :END:
 125 On calculating P_{CHOOSE}(det | vbd, L) from the 1-sentence corpus
 126 "det nn vbd", there are two ways in which 'det' could be left
 127 attached; one is where it is attached non-adjacently (after 'vbd' has
 128 attached 'nn'):
 129 :>>> c_L = c(0,0,(SEAL,g.tagnum('det')),0,g,'det nn vbd'.split(),{},{})
 130 :>>> c_L
 131 :0.62669683257918563 # so far so good
 132 :
 133 :>>> c_R = c(1,2,(RGO_L,g.tagnum('vbd')),2,g,'det nn vbd'.split(),{},{})
 134 :>>> c_R
 135 :0.11312217194570134 # still seeems an OK probability
 136 :
 137 :>>> c_M = c(0,2,(RGO_L,g.tagnum('vbd')),2,g,'det nn vbd'.split(),{},{})
 138 :>>> c_M
 139 :0.31674208144796384 # and this seems good
 140 :
 141 :>>> c_L / (c_R * c_M)
 142 :17.490571428571432  # but this is Way off...
 143
 144 * DONE L&Y formula (20) or c()-formula?
 145    CLOSED: [2008-07-23 Wed 10:52]
 146   :PROPERTIES:
 147    :ARCHIVE_TIME: 2008-07-23 Wed 10:52
 148    :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
 149    :ARCHIVE_OLPATH: P_STOP and P_CHOOSE for IO/EM (reestimation)
 150    :ARCHIVE_CATEGORY: DMVCCM
 151    :ARCHIVE_TODO: DONE
 152   :END:
 153 For P_CHOOSE, use this formula to get a[h_,_a_,h_]:
 154 | w_sent = 1/P_sent * \sum_{r} prob(h_->_a_ h_) * e(s,r,_a_) * e(r+1,t, h_) * f(s,t,h_) |
 155 | /                                                                                     |
 156 | v_sent = 1/P_sent * e(s,t,h_) * f(s,t,h_) = c(s,t,h_)                                 |
 157
 158 then divide a[h_,_a_,h_] by sum of a[h_,_x_,h_] for all x. \\
 159 Similarly, P_CHOOSE(a|h,R) = a[h,h,_a_] / \sum_{x} a[h,h,_x_].
 160
 161
 162 For stop rules, on the other hand, we use the following:\\
 163 PSTOP(h|left,...) =eg. c(s,t,_h_) / c(s,t,h_) for certain s,t depending on adjacency \\
 164 <=>
 165 | 1/P_sent * e(s,t,_h_) * f(s,t,_h_) |
 166 | /                                  |
 167 | 1/P_sent * e(s,t, h_) * f(s,t, h_) |
 168
 169 A direct translation of h_->STOP h_ into L&Y formula (20) would give:
 170
 171 | 1/P_sent * prob(_ h_ -> STOP h_) * e(s,t,h_) * f(s,t,_h_) |
 172 | /                                                         |
 173 | 1/P_sent * e(s,t,_h_) * f(s,t,_h_)                        |
 174
 175 But we don't want that, since what we're really after is the
 176 "upside-down" probability of stopping when "generating upwards" in the
 177 PCFG tree, so just keep using c()/c() like we've been doing.
 178
 179 In stop rules, the prob() is PSTOP, while for
 180 attachment rules, it's PCHOOSE*(1-PSTOP).
 181 * DONE [#A] Implement P_CHOOSE formula.
 182   :PROPERTIES:
 183   :ARCHIVE_TIME: 2008-07-23 Wed 10:52
 184   :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
 185   :ARCHIVE_OLPATH: P_STOP and P_CHOOSE for IO/EM (reestimation)
 186   :ARCHIVE_CATEGORY: DMVCCM
 187   :ARCHIVE_TODO: TODO
 188   :END:
 189 Earlier was assuming this, but have to change it into the above configurations:
 190
 191 | P_{CHOOSE}(a : h,R) = | \sum_{corpus} \sum_{s=loc(h)} \sum_{t > loc(h)} \sum_{loc(h) < r <= t} c(r,t,_a_)               |
 192 |                       | \sum_{corpus} \sum_{s=loc(h)} \sum_{t > loc(h)} \sum_{loc(h) < r <= t} c(s,t,h) * c(s, r-1, h_) |
 193 |                       |                                                                                                 |
 194 | P_{CHOOSE}(a : h,L) = | \sum_{corpus} \sum_{s<loc(h)} \sum_{t>=loc(h)} \sum_{r<loc(h)} c(s,r,_a_)                       |
 195 |                       | \sum_{corpus} \sum_{s<loc(h)} \sum_{t>=loc(h)} \sum_{r<loc(h)} c(s,t,h_) * c(r+1, t, h_)        |
 196 t >= loc(h) since there are many possibilites for right-attachments
 197 below, and each of them alone gives a lower probability (through
 198 multiplication) to the upper tree (so add them all)
 199
 200 The reason we have to check /both/ children of the attachments is that we
 201 have to make sure they are contiguous (otherwise we would have no way
 202 of ruling out eg. h_->_b_,_b_->b_->_a_, where h_ covers *s* and *t*,_b_ is
 203 from *s* to *x<r* and _ a_ is from *s* to *r*).
 204
 205 * DONE P_STOP formulas for various dir and adj:
 206    CLOSED: [2008-06-15 Sun 23:40]
 207   :PROPERTIES:
 208    :ARCHIVE_TIME: 2008-07-23 Wed 10:52
 209    :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
 210    :ARCHIVE_OLPATH: P_STOP and P_CHOOSE for IO/EM (reestimation)
 211    :ARCHIVE_CATEGORY: DMVCCM
 212    :ARCHIVE_TODO: DONE
 213   :END:
 214 Assuming this:
 215
 216 | P_{STOP}(STOP : h,L,non_adj) = | \sum_{corpus} \sum_{s<loc(h)} \sum_{t>=loc(h)} c(s,t,_h_) |
 217 |                                | \sum_{corpus} \sum_{s<loc(h)} \sum_{t>=loc(h)} c(s,t,h_)  |
 218 |                                |                                                           |
 219 | P_{STOP}(STOP : h,L,adj) =     | \sum_{corpus} \sum_{s=loc(h)} \sum_{t>=loc(h)} c(s,t,_h_) |
 220 |                                | \sum_{corpus} \sum_{s=loc(h)} \sum_{t>=loc(h)} c(s,t,h_)  |
 221 |                                |                                                           |
 222 | P_{STOP}(STOP : h,R,non_adj) = | \sum_{corpus} \sum_{s=loc(h)} \sum_{t>loc(h)} c(s,t,h_)   |
 223 |                                | \sum_{corpus} \sum_{s=loc(h)} \sum_{t>loc(h)} c(s,t,h)    |
 224 |                                |                                                           |
 225 | P_{STOP}(STOP : h,R,adj) =     | \sum_{corpus} \sum_{s=loc(h)} \sum_{t=loc(h)} c(s,t,h_)   |
 226 |                                | \sum_{corpus} \sum_{s=loc(h)} \sum_{t=loc(h)} c(s,t,h)    |
 227
 228 (And P_{STOP}(-STOP|...) = 1 - P_{STOP}(STOP|...) )
 229 * DONE COMMENT write out tex formulas for outer
 230   :PROPERTIES:
 231   :ARCHIVE_TIME: 2008-07-23 Wed 10:53
 232   :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
 233   :ARCHIVE_OLPATH: outer probabilities
 234   :ARCHIVE_CATEGORY: DMVCCM
 235   :END:
 236   [[file:tex/formulas.tex::P_%20OUTSIDE%20SEAL%20w%20i%20j%20P_%20STOP%20stop%20w%20left%20adj%20i][formulas.tex]]
 237 * DONE outer probabilities
 238   CLOSED: [2008-06-12 Thu 11:11]
 239   :PROPERTIES:
 240   :ARCHIVE_TIME: 2008-07-23 Wed 10:55
 241   :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
 242   :ARCHIVE_CATEGORY: DMVCCM
 243   :ARCHIVE_TODO: DONE
 244   :END:
 245 # <<outer>>
 246 See also [[http://www.student.uib.no/~kun041/dmvccm/tex/formulas.pdf][pdf of P_{OUTER}]], in the style of Klein's thesis appendix.
 247
 248 ** outer probabilities -- the algorithm
 249 When looping through the rules which rewrite to Node, there are 6
 250 different configurations, based on what the above (mother) node is,
 251 and what the Node for which we're computing is.
 252
 253 Here *r* is not between *s* and *t* as in inner(), but an /outer/ index. *loc_N*
 254 is the location of the Node head in the sentence, *loc_m* for the head
 255 of the mother of Node.
 256
 257 + mother is a RIGHT-stop:
 258   - outer(*s, t*, mother.LHS, *loc_N*), no inner-call
 259   - adjacent iff *t* == *loc_m*
 260 + mother is a  LEFT-stop:
 261   - outer(*s, t*, mother.LHS, *loc_N*), no inner-call
 262   - adjacent iff *s* == *loc_m*
 263
 264 + Node is on the LEFT branch (mother.L == Node)
 265   * and mother is a LEFT attachment:
 266     - *loc_N* will be in the LEFT branch, can be anything here.
 267     - In the RIGHT, non-attached, branch we find inner(*t+1, r*, mother.R,
 268       *loc_m*) for all possible *loc_m* in the right part of the sentence.
 269     - outer(*s, r*, mother.LHS, *loc_m*).
 270     - adjacent iff *t+1* == *loc_m*
 271   * and mother is a RIGHT attachment:
 272     - *loc_m* = *loc_N*.
 273     - In the RIGHT, attached, branch we find inner(*t+1, r*, mother.R, *loc_R*) for
 274       all possible *loc_R* in the right part of the sentence.
 275     - outer(*s, r*, mother.LHS, *loc_N*).
 276     - adjacent iff *t* == *loc_m*
 277
 278 + Node is on the RIGHT branch (mother.R == Node)
 279   * and mother is a LEFT attachment:
 280     - *loc_m* = *loc_N*.
 281     - In the LEFT, attached, branch we find inner(*r, s-1*, mother.L, *loc_L*) for
 282       all possible *loc_L* in the left part of the sentence.
 283     - outer(*r, t*, mother.LHS, *loc_m*).
 284     - adjacent iff *s* == *loc_m*
 285   * and mother is a RIGHT attachment:
 286     - *loc_N* will be in the RIGHT branch, can be anything here.
 287     - In the LEFT, non-attached, branch we find inner(*r, s-1*, mother.L, *loc_m*) for
 288       all possible *loc_m* in the left part of the sentence.
 289     - outer(*r, t*, mother.LHS, *loc_N*).
 290     - adjacent iff *s-1* == *loc_m*
 291
 292 [[file:outer_attachments.jpg]]
 293
 294 : in notes:   in code (constants):    in Klein thesis:
 295 :-------------------------------------------------------------------
 296 : _h_         SEAL                    bar over h
 297 :  h_         RGO_L                   right-under-left-arrow over h
 298 :  h          GO_R                    right-arrow over h
 299 :
 300 :             LGO_R                   left-under-right-arrow over h
 301 :             GO_L                    left-arrow over h
 302
 303 Also, unlike in [[http://bibsonomy.org/bibtex/2b9f6798bb092697da7042ca3f5dee795][Lari & Young]], non-ROOT ('S') symbols may cover the
 304 whole sentence, but ROOT may /only/ appear if it covers the whole
 305 sentence.
 306
 307
 308 * DONE P_STOP and P_CHOOSE for IO/EM (reestimation)
 309   :PROPERTIES:
 310   :ARCHIVE_TIME: 2008-07-23 Wed 10:55
 311   :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
 312   :ARCHIVE_CATEGORY: DMVCCM
 313   :END:
 314 [[file:src/dmv.py::DMV%20probabilities][dmv-P_STOP]]
 315 Remember: The P_{STOP} formula is upside-down (left-to-right also).
 316 (In the article..not the [[http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf][thesis]])
 317 * DONE Separate initialization to another file?                       :PRETTIER:
 318    CLOSED: [2008-06-08 Sun 12:51]
 319   :PROPERTIES:
 320    :ARCHIVE_TIME: 2008-07-23 Wed 11:12
 321    :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
 322    :ARCHIVE_OLPATH: Initialization
 323    :ARCHIVE_CATEGORY: DMVCCM
 324    :ARCHIVE_TODO: DONE
 325   :END:
 326 [[file:src/harmonic.py::harmonic%20py%20initialization%20for%20dmv][harmonic.py]]
 327 * DONE DMV Initialization probabilities
 328   :PROPERTIES:
 329   :ARCHIVE_TIME: 2008-07-23 Wed 11:12
 330   :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
 331   :ARCHIVE_OLPATH: Initialization
 332   :ARCHIVE_CATEGORY: DMVCCM
 333   :ARCHIVE_TODO: DONE
 334   :END:
 335 (from initialization frequency)
 336 * DONE DMV Initialization frequencies
 337   CLOSED: [2008-05-27 Tue 20:04]
 338   :PROPERTIES:
 339   :ARCHIVE_TIME: 2008-07-23 Wed 11:12
 340   :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
 341   :ARCHIVE_OLPATH: Initialization
 342   :ARCHIVE_CATEGORY: DMVCCM
 343   :ARCHIVE_TODO: DONE
 344   :END:
 345 ** P_STOP
 346 P_{STOP} is not well defined by K&M. One possible interpretation given
 347 the sentence [det nn vb nn] is
 348 : f_{STOP}( STOP|det, L, adj) +1
 349 : f_{STOP}(-STOP|det, L, adj) +0
 350 : f_{STOP}( STOP|det, L, non_adj) +1
 351 : f_{STOP}(-STOP|det, L, non_adj) +0
 352 : f_{STOP}( STOP|det, R, adj) +0
 353 : f_{STOP}(-STOP|det, R, adj) +1
 354 :
 355 : f_{STOP}( STOP|nn, L, adj) +0
 356 : f_{STOP}(-STOP|nn, L, adj) +1
 357 : f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 358 : f_{STOP}(-STOP|nn, L, non_adj) +0
 359 *** TODO tweak
 360 # <<pstoptweak>>
 361 :            f[head,  'STOP', 'LN'] += (i_h <= 1)     # first two words
 362 :            f[head, '-STOP', 'LN'] += (not i_h <= 1)
 363 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 364 :            f[head, '-STOP', 'LA'] += (not i_h == 0)
 365 :            f[head,  'STOP', 'RN'] += (i_h >= n - 2) # last two words
 366 :            f[head, '-STOP', 'RN'] += (not i_h >= n - 2)
 367 :            f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 368 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 369 vs
 370 :            # this one requires some additional rewriting since it
 371 :            # introduces divisions by zero
 372 :            f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 373 :            f[head, '-STOP', 'LN'] += (not i_h <= 1) # not first two
 374 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 375 :            f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 376 :            f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 377 :            f[head, '-STOP', 'RN'] += (not i_h >= n - 2) # not last two
 378 :            f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 379 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 380 vs
 381 :            f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 382 :            f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 383 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 384 :            f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 385 :            f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 386 :            f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 387 :            f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 388 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 389 vs
 390 "all words take the same number of arguments" interpreted as
 391 :for all heads:
 392 :    p_STOP(head, 'STOP', 'LN') = 0.3
 393 :    p_STOP(head, 'STOP', 'LA') = 0.5
 394 :    p_STOP(head, 'STOP', 'RN') = 0.4
 395 :    p_STOP(head, 'STOP', 'RA') = 0.7
 396 (which we easily may tweak in init_zeros())
 397 ** P_CHOOSE
 398 Go through the corpus, counting distances between heads and
 399 arguments. In [det nn vb nn], we give
 400 - f_{CHOOSE}(nn|det, R) +1/1 + C
 401 - f_{CHOOSE}(vb|det, R) +1/2 + C
 402 - f_{CHOOSE}(nn|det, R) +1/3 + C
 403   - If this were the full corpus, P_{CHOOSE}(nn|det, R) would have
 404     (1+1/3+2C) / sum_a f_{CHOOSE}(a|det, R)
 405
 406 The ROOT gets "each argument with equal probability", so in a sentence
 407 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 408 just a frequency count of the corpus...
 409
 410 In a sense there are no terminal probabilities, since an /h/ can only
 411 rewrite to an 'h' anyway (it's just a check for whether, at this
 412 location in the sentence, we have the right POS-tag).
 413 * DONE Expectation Maximation in IO/DMV-terms
 414   :PROPERTIES:
 415   :ARCHIVE_TIME: 2008-07-23 Wed 11:16
 416   :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
 417   :ARCHIVE_CATEGORY: DMVCCM
 418   :END:
 419 outer(i,j,Node) and inner(i,j,Node) calculates the expected number of
 420 trees (CNF-)headed by Node from =i= to =j= (sentence locations). This uses
 421 the P_STOP and P_CHOOSE values.
 422
 423 When re-estimating, we use the expected values from outer() and
 424 inner() to get new values for P_STOP and P_CHOOSE. When we've
 425 re-estimated for the entire corpus, we copy the new P_STOP and
 426 P_CHOOSE probabilities into our DMV_Grammar(), so that in the next
 427 round we use new probN and probA to find outer- and
 428 inner-probabilites.
 429
 430 Since "adjacency" is not captured in regular CNF rules, we need two
 431 probabilites for each "rule", and outer() and inner() have to know when
 432 to use which.
 433
 434 * DONE [#A] Reestimate P_ROOT
 435   CLOSED: [2008-07-23 Wed 14:42]
 436   :PROPERTIES:
 437   :ARCHIVE_TIME: 2008-07-23 Wed 14:42
 438   :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
 439   :ARCHIVE_CATEGORY: DMVCCM
 440   :ARCHIVE_TODO: DONE
 441   :END:
 442 Should be easy, assuming my new [[file:tex/formulas.pdf][formula]] (section 4.4) is correct.
 443
 444 * DONE Make inner() and outer() also allow left-first attachment
 445    CLOSED: [2008-07-23 Wed 13:57]
 446   :PROPERTIES:
 447    :ARCHIVE_TIME: 2008-07-23 Wed 14:42
 448    :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
 449    :ARCHIVE_OLPATH: Combine CCM with DMV
 450    :ARCHIVE_CATEGORY: DMVCCM
 451    :ARCHIVE_TODO: DONE
 452   :END:
 453 Using P_{ORDER}(/left-first/ | w) etc.
 454
 455 Time increased only from .8 to .11 sec, seems good.
 456 * DONE Alternate CNF-style rules
 457   :PROPERTIES:
 458   :ARCHIVE_TIME: 2008-07-30 Wed 00:39
 459   :ARCHIVE_FILE: ~/dmvccm/DMVCCM.org
 460   :ARCHIVE_OLPATH: Alternative CNF for DMV
 461   :ARCHIVE_CATEGORY: DMVCCM
 462   :END:
 463 :  h      Terminal
 464 :  h[RA]  Non-Terminal, attaching for the first time to the right
 465 :  h[RN]  Non-Terminal, attaching non-adjacently to the right
 466 :  h_[RA] Non-Terminal, stopping to the right adjacently
 467 :  h_[RN] Non-Terminal, stopping to the right non-adjacently
 468 :  h_[LA] Non-Terminal, attaching for the first time to the left
 469 :  h_[LN] Non-Terminal, attaching non-adjacently to the left
 470 : _h_[LA] Non-Terminal, stopping to the left adjacently
 471 : _h_[LN] Non-Terminal, stopping to the left non-adjacently
 472
 473 :   h[RA] -> h       _a_[LA]  # adjacent right attachment must go to "terminal"
 474 :   h[RA] -> h       _a_[LN]  # adjacent right attachment must go to "terminal"
 475 :
 476 :   h[RN] -> h[RA]   _a_[LA]  # already attached to right
 477 :   h[RN] -> h[RN]   _a_[LN]
 478 :
 479 :  h_[RA] -> h       STOP     # adjacent right stop must go to "terminal"
 480 :  h_[RN] -> h[RN]   STOP     # o/w non-adjacent
 481 :  h_[RN] -> h[RA]   STOP
 482 :
 483 :  h_[LA] -> _a_[LA] h_[RA]   # adjacent left attachment must
 484 :  h_[LA] -> _a_[LN] h_[RN]   # go to mothers of stop rules
 485 :
 486 :  h_[LN] -> _a_[LA] h_[LN]   # already attached to left
 487 :  h_[LN] -> _a_[LN] h_[LA]
 488 :
 489 : _h_[LA] -> STOP    h_[RA]   # adjacent left stop goes
 490 : _h_[LA] -> STOP    h_[RN]   # straight to a right stop
 491 :
 492 : _h_[LN] -> STOP    h_[LA]   # non-adjacent left stop
 493 : _h_[LN] -> STOP    h_[LN]   # goes to a left attachment rule
 494
 495 The reestimation function still has to sum over the various
 496 possibilities of N's and A's; but it seems to be simpler than the
 497 loc_h-method altogether.
 498
 499 One might reduce the number of rules a tiny bit, by having eg. unary rules
 500 : _a_ -> _a_[LA]
 501 : _a_ -> _a_[LN]
 502 etc. (although that might just make it all more confusing)
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516