DMVCCM.org

   1 # -*- coding: mule-utf-8-unix -*-
   2
   3 #+STARTUP: overview
   4 #+TAGS: OPTIMIZE PRETTIER
   5 #+STARTUP: hidestars
   6 #+TITLE: DMV/CCM -- todo-list / progress
   7 #+AUTHOR: Kevin Brubeck Unhammer
   8 #+EMAIL: K.BrubeckUnhammer at student uva nl
   9 #+OPTIONS: ^:{}  skip:t
  10 #+LANGUAGE: en
  11 #+SEQ_TODO: TOGROK TODO DONE
  12
  13 [[file:src/main.py][main.py]]
  14 [[file:src/wsjdep.py][wsjdep.py]]
  15 [[file:src/loc_h_dmv.py][loc_h_dmv.py]]
  16
  17 Meta-todo:
  18 - fix stop and attachment formulas so they divide before summing
  19   - both in formulas.tex and in code
  20   - and check with other values for P_ORDER
  21
  22 [[file:DMVCCM.html][DMVCCM.html]]
  23
  24 * DMV/CCM report and project
  25   DEADLINE: <2008-09-21 Sun>
  26 - [[http://www.student.uib.no/~kun041/dmvccm/report.pdf][report.pdf]] -- Draft report for the whole project, including formulas
  27   for the full algorithms
  28
  29 - [[file:src/main.py][main.py]] -- evaluation, corpus likelihoods
  30 - [[file:src/wsjdep.py][wsjdep.py]] -- corpus reader for the dependency parsed WSJ
  31
  32 - [[file:src/loc_h_dmv.py][loc_h_dmv.py]] -- DMV-IO and reestimation
  33 - [[file:src/loc_h_harmonic.py][loc_h_harmonic.py]] -- DMV initialization
  34
  35 - [[file:src/common_dmv.py][common_dmv.py]] -- various functions used by loc_h_dmv and others
  36 - [[file:src/io.py][io.py]] -- non-DMV IO
  37
  38 Deprecated:
  39 - [[file:src/cnf_dmv.py][cnf_dmv.py]] -- cnf-like implementation of DMV
  40 - [[file:src/cnf_harmonic.py][cnf_harmonic.py]] -- initialization for cnf_dmv
  41
  42 [[http://www.student.uib.no/~kun041/dmvccm/DMVCCM_archive.html][Archived entries]] from this file.
  43 * Notation
  44 : old notes:   new notes:   in tex/code (constants):    in Klein thesis:
  45 :--------------------------------------------------------------------------------------
  46 : _h_            _h_            SEAL                    bar over h
  47 :  h_             h><           RGOL                    right-under-left-arrow over h
  48 :  h              h>            GOR                     right-arrow over h
  49 :
  50 :               ><h             LGOR                    left-under-right-arrow over h
  51 :                <h             GOL                     left-arrow over h
  52 These are represented in the code as pairs =(s_h,h)=, where =h= is an
  53 integer (POS-tag) and =s_h= \in ={SEAL,RGOL,GOR,LGOR,GOL}=.
  54
  55 =P_ATTACH= and =P_CHOOSE= are synonymous, I try to use the
  56 former. Also,
  57 : P_GO_AT(a|h,dir,adj) := P_ATTACH(a|h,dir)*(1-P_STOP(STOP|h,dir,adj)
  58
  59 (precalculated after each reestimation with =g.p_GO_AT = make_GO_AT(g.p_STOP,g.p_ATTACH)=)
  60 ** COMMENT qtrees, tex
  61 \usepackage{qtree}
  62 \usepackage{amssymb}
  63
  64 \newcommand{\GOR}[1]{\overrightarrow{#1}}
  65 \newcommand{\RGOL}[1]{\overleftarrow{\overrightarrow{#1}}}
  66
  67 \newcommand{\SEAL}[1]{\overline{#1}}
  68
  69 \newcommand{\LGOR}[1]{\overrightarrow{\overleftarrow{#1}}}
  70 \newcommand{\GOL}[1]{\overleftarrow{#1}}
  71
  72 \Tree [.{$\RGOL{h}$} [.{$_s \SEAL{a} _t$\\
  73   Node} ] {$_{t+1} \RGOL{h} _r$\\
  74   R} ]
  75 \Tree [.{$\GOR{h}$} {$_{s} \GOR{h} _{t}$\\
  76   Node} [.{$_{t+1} \SEAL{a} _r$\\
  77   R} ] ]
  78 \Tree [.{$\RGOL{h}$} [.{$_r \SEAL{a} _{s-1}$\\
  79   L} ] {$_{s} \RGOL{h} _t$\\
  80   Node} ]
  81 \Tree [.{$\GOR{h}$} {$_{r} \GOR{h} _{s-1}$\\
  82   L} [.{$_{s} \SEAL{a} _t$\\
  83   Node} ] ]
  84
  85
  86 \Tree [.{$h\urcorner$} [.{$_s \ulcorner a\urcorner _t$\\
  87   Node} ] {$_{t+1} h\urcorner _r$\\
  88   R} ]
  89 \Tree [.{$h$} {$_{s} h _{t}$\\
  90   Node} [.{$_{t+1} \ulcorner a\urcorner _r$\\
  91   R} ] ]
  92 \Tree [.{$h\urcorner$} [.{$_r \ulcorner a\urcorner _{s-1}$\\
  93   L} ] {$_{s} h\urcorner _t$\\
  94   Node} ]
  95 \Tree [.{$h$} {$_{r} h _{s-1}$\\
  96   L} [.{$_{s} \ulcorner a\urcorner _t$\\
  97   Node} ] ]
  98 * Testing the dependency parsed WSJ
  99 [[file:src/wsjdep.py][wsjdep.py]] uses NLTK (sort of) to get a dependency parsed version of
 100 WSJ10 into the format used in mpp() in loc_h_dmv.py.
 101
 102 As a default, =WSJDepCorpusReader= looks for the file =wsj.combined.10.dep= in
 103 =../corpus/wsjdep=.
 104
 105 Only =sents()=, =tagged_sents()= and =parsed_sents()= (plus a new function
 106 =tagonly_sents()=) are implemented, the other NLTK corpus functions are
 107 ..um.. undefined...
 108 ** TODO [#A] Should =def evaluate= use add_root?
 109 [[file:src/main.py::def%20evaluate%20g%20tagonly_sents%20parsed_sents][main.py]] evaluate
 110 [[file:src/wsjdep.py][wsjdep.py]] add_root
 111
 112 (just has to count how many pairs are in there; Precision and Recall)
 113 * TOGROK Combine CCM with DMV
 114
 115 # <<comboquestions>>
 116
 117 Questions about the =P_COMBO= info in [[http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf][Klein's thesis]]:
 118 - Page 109 (pdf: 125): We have to premultiply "all our probabilities"
 119   by the CCM base product /\Pi_{<i,j>}
 120   P_{SPAN}(\alpha(i,j,s)|false)P_{CONTEXT}(\beta(i,j,s)|false)/; which
 121   probabilities are included under "all"? I'm assuming this includes
 122   =P_ATTACH= since each time =P_ATTACH= is used, /\phi/ is multiplied in
 123   (pp.110-111 ibid.); but /\phi/ is not used for STOPs, so should we not
 124   have our CCM product multiplied in there? How about =P_ROOT=?
 125   (Guessing =P_ORDER= is way out of the question...)
 126 - For the outside probabilities, is it correct to assume we multiply
 127   in /\phi(j,k)/ or /\phi(k,i)/ when calculating =inner(i,j...)=? (Eg., only
 128   for the outside part, not for the whole range.) I don't understand
 129   the notation in =O()= on p.103.
 130 * TOGROK Reestimate P_ORDER ?
 131 * Most Probable Parse
 132 ** TOGROK Find MPP with CCM
 133 ** DONE Find Most Probable Parse of given test sentence, in DMV
 134   CLOSED: [2008-07-23 Wed 10:56]
 135 inner() optionally keeps track of the highest probability children of
 136 any node in =mpptree=. Say we're looking for =inner(i,j,(s_h,h),loc_h)= in
 137 a certain sentence, and we find some possible left and right children,
 138 we add to =mpptree[i,j,(s_h,h),loc_h]= the triple =(p, L, R)= where =L= and
 139 =R= are of the same form as the key (=i,j,(s_h,h),loc_h=) and =p= is the
 140 probability of this node rewriting to =L= and =R=,
 141 eg. =inner(L)*inner(R)*p_GO_AT= or =p_STOP= or whatever. We only add this
 142 entry to =mpptree= if there wasn't a higher-probability entry there
 143 before.
 144
 145 Then, after =inner_sent= makes an =mpptree=, we find the /relevant/
 146 head-argument pairs by searching through the tree using a queue,
 147 adding the =L= and =R= keys of any entry to the queue as we find them
 148 (skipping =STOP= keys), and adding any attachment entries to a set of
 149 triples =(head,argument,dir)=. Thus we have our most probable parse,
 150 eg.
 151 : set([( ROOT, (vbd,2),RIGHT),
 152 :      ((vbd,2),(nn,1),LEFT),
 153 :      ((vbd,2),(nn,3),RIGHT),
 154 :      ((nn,1),(det,0),LEFT)])
 155 * Initialization
 156 [[file:~/Documents/Skole/V08/Probability/dmvccm/src/dmv.py::Initialization%20todo][dmv-inits]]
 157
 158 We go through the corpus, since the probabilities are based on how far
 159 away in the sentence arguments are from their heads.
 160 ** TOGROK CCM Initialization
 161 P_{SPLIT} used here... how, again?
 162 * TODO [#C] Alternative CNF for DMV
 163
 164 # <<dmv2cnf>>
 165 - [[file:src/cnf_dmv.py][cnf_dmv.py]]
 166 - [[file:src/cnf_harmonic.py][cnf_harmonic.py]]
 167
 168 See section 5 of [[file:tex/formulas.pdf][formulas.pdf]].
 169
 170 Given a grammar with certain p_ATTACH, p_STOP and p_ROOT, we get:
 171 :>>> print testgrammar_h():
 172 :  h>< -->   h>  STOP   [0.30]
 173 :  h>< -->  >h>  STOP   [0.40]
 174 : _h_  --> STOP    h><  [1.00]
 175 : _h_  --> STOP   <h><  [1.00]
 176 : >h>  -->   h>   _h_   [1.00]
 177 : >h>  -->  >h>   _h_   [1.00]
 178 : <h>< -->  _h_    h><  [0.70]
 179 : <h>< -->  _h_   <h><  [0.60]
 180 :ROOT  --> STOP   _h_   [1.00]
 181
 182 ** TODO [#A] Make and implement an equivalent grammar that's /pure/ CNF
 183 ...since I'm not sure about my unary reestimation rules (section 5 of
 184 [[file:tex/formulas.pdf][formulas]]).
 185
 186 For any rule where LHS is =_h_= we also have a corresponding one with
 187 LHS =ROOT=, only difference being that we multiply in =p_ROOT(h)=.
 188
 189 For any rule where LHS is =.h>=, we use adjacent probabilities for the
 190 left child; if LHS is =<h.= we use adjacent probabilities for the right
 191 child. Only =_h_= and =_h>_= (plus =ROOT=) get to introduce the pre-terminal
 192 =h= (where =h=, =ROOT= and =_h_= all rewrite to the terminal
 193 @<code>'h'@</code>), and only =_h_= and =_h>_= (plus =ROOT=) act as STOP
 194 rules (eg. get to multiply in =p(STOP)=).
 195
 196 :  h   -->  'h'         1
 197 : _h_  -->  'h'         p(STOP|h,L,adj) * p(STOP|h,R,adj)
 198 : ROOT -->  'h'         p(STOP|h,L,adj) * p(STOP|h,R,adj) * p_ROOT(h)
 199 :
 200 : _h_  -->   h    _a_   p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
 201 : _h_  -->   h    .h>   p(STOP|h,L,adj) * p(STOP|h,R,non)
 202 : .h>  -->  _a_   _b_   p(a|h,R)*p(-STOP|h,R,adj) * p(b|h,R)*p(-STOP|h,R,non)
 203 : .h>  -->  _a_    h>   p(a|h,R)*p(-STOP|h,R,adj)
 204 :  h>  -->  _a_   _b_   p(a|h,R)*p(-STOP|h,R,non) * p(b|h,R)*p(-STOP|h,R,non)
 205 :  h>  -->  _a_    h>   p(a|h,R)*p(-STOP|h,R,non)
 206 :
 207 : _h_  -->  _a_    h    p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj)
 208 : _h_  -->  <h.    h    p(STOP|h,L,non) * p(STOP|h,R,adj)
 209 : <h.  -->  _b_   _a_   p(b|h,L)*p(-STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
 210 : <h.  -->  <h    _a_                               p(a|h,L)*p(-STOP|h,L,adj)
 211 : <h   -->  _a_   _b_   p(a|h,L)*p(-STOP|h,L,non) * p(b|h,L)*p(-STOP|h,L,non)
 212 : <h   -->  <h    _a_   p(a|h,L)*p(-STOP|h,L,non)
 213 :
 214 : _h_  -->  <h.    _h>_ p(STOP|h,L,non)
 215 : _h_  -->  _a_    _h>_ p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
 216 : _h>_ -->   h     .h>  p(STOP|h,R,non)
 217 : _h>_ -->   h     _a_  p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
 218 :
 219 : ROOT -->   h    _a_   p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj) * p_ROOT(h)
 220 : ROOT -->   h    .h>   p(STOP|h,L,adj) * p(STOP|h,R,non) * p_ROOT(h)
 221 :
 222 : ROOT -->  _a_    h    p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
 223 : ROOT -->  <h.    h    p(STOP|h,L,non) * p(STOP|h,R,adj) * p_ROOT(h)
 224 :
 225 : ROOT -->  <h.   _h>_  p(STOP|h,L,non) * p_ROOT(h)
 226 : ROOT -->  _a_   _h>_  p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
 227 :
 228
 229 Since we have rules rewriting =h= to =a= and =b=, we have a rule-set
 230 numbering more than n_{tags}^{2}.
 231
 232 ** TOGROK [#A] convert L&Y-based reestimation into P_ATTACH and P_STOP values
 233 Sum over the various rules? Or something? Must think of this.
 234 ** TODO [#C] move as much as possible into common_dmv.py
 235 [[file:src/common_dmv.py][common_dmv.py]]
 236 ** DONE L&Y-based reestimation for cnf_dmv
 237    CLOSED: [2008-08-21 Thu 16:35]
 238 ** DONE dmv2cnf re-estimation formulas
 239    CLOSED: [2008-08-21 Thu 16:36]
 240 ** DONE inner and outer for cnf_dmv.py, also cnf_harmonic.py
 241 * [#C] Deferred
 242 http://wiki.python.org/moin/PythonSpeed/PerformanceTips Eg., use
 243 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
 244 for-loops; use local variables for globals (global variables or or
 245 functions), etc.
 246 ** TODO Clean up reestimation code                                    :PRETTIER:
 247 ** TODO [#A] compare speed of w_left/right(...) and w(LEFT/RIGHT, ...) :OPTIMIZE:
 248 ** TODO when reestimating P_STOP etc, remove rules with p < epsilon   :OPTIMIZE:
 249 ** TODO inner_dmv, short ranges and impossible attachment             :OPTIMIZE:
 250 If s-t <= 2, there can be only one attachment below, so don't recurse
 251 with both Lattach=True and Rattach=True.
 252
 253 If s-t <= 1, there can be no attachment below, so only recurse with
 254 Lattach=False, Rattach=False.
 255
 256 Put this in the loop under rewrite rules (could also do it in the STOP
 257 section, but that would only have an effect on very short sentences).
 258 ** TODO clean up the module files                                     :PRETTIER:
 259 Is there better way to divide dmv and harmonic? There's a two-way
 260 dependency between the modules. Guess there could be a third file that
 261 imports both the initialization and the actual EM stuff, while a file
 262 containing constants and classes could be imported by all others:
 263 : dmv.py imports dmv_EM.py imports dmv_classes.py
 264 : dmv.py imports dmv_inits.py imports dmv_classes.py
 265
 266 ** TOGROK Some (tagged) sentences are bound to come twice             :OPTIMIZE:
 267 Eg, first sort and count, so that the corpus
 268 [['nn','vbd','det','nn'],
 269  ['vbd','nn','det','nn'],
 270  ['nn','vbd','det','nn']]
 271 becomes
 272 [(['nn','vbd','det','nn'],2),
 273  (['vbd','nn','det','nn'],1)]
 274 and then in each loop through sentences, make sure we handle the
 275 frequency correctly.
 276
 277 Is there much to gain here?
 278
 279 ** TOGROK tags as numbers or tags as strings?                         :OPTIMIZE:
 280 Need to clean up the representation.
 281
 282 Stick with tag-strings in initialization then switch to numbers for
 283 IO-algorithm perhaps? Can probably afford more string-matching in
 284 initialization..
 285 * Adjacency and combining it with the inside-outside algorithm
 286 Each DMV probability (for a certain PCFG node) has both an adjacent
 287 and a non-adjacent probability. inner() and outer() needs the correct
 288 one in each case.
 289
 290 In each inner() call, loc_h is the location of the head of this
 291 dependency structure. In each outer() call, it's the head of the /Node/,
 292 the structure we're looking outside of.
 293
 294 We call inner() for each location of a head, and on each terminal,
 295 loc_h must equal =i= (and =loc_h+1= equal =j=). In the recursive attachment
 296 calls, we use the locations (sentence indices) of words to the left or
 297 right of the head in calls to inner(). /loc_h lets us check whether we
 298 need probN or probA/.
 299 ** Possible alternate type of adjacency
 300 K&M's adjacency is just whether or not an argument has been generated
 301 in the current direction yet. One could also make a stronger type of
 302 adjacency, where h and a are not adjacent if b is in between, eg. with
 303 the sentence "a b h" and the structure ((h->a), (a->b)), h is
 304 K&M-adjacent to a, but not next to a, since b is in between. It's easy
 305 to check this type of adjacency in inner(), but it needs new rules for
 306 P_STOP reestimation.
 307 * Python-stuff
 308 # <<python>>
 309 Make those debug statements steal a bit less attention in emacs:
 310 :(font-lock-add-keywords
 311 : 'python-mode                   ; not really regexp, a bit slow
 312 : '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
 313 :\\1    .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
 314 :(font-lock-add-keywords
 315 : 'python-mode
 316 : '(("\\<\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
 317
 318 - [[file:src/pseudo.py][pseudo.py]]
 319 - http://nltk.org/doc/en/structured-programming.html recursive dynamic
 320 - http://nltk.org/doc/en/advanced-parsing.html
 321 - http://jaynes.colorado.edu/PythonIdioms.html
 322
 323
 324
 325 * Git
 326 Repository web page: http://repo.or.cz/w/dmvccm.git
 327
 328 Setting up a new project:
 329 : git init
 330 : git add .
 331 : git commit -m "first release"
 332
 333 Later on: (=-a= does =git rm= and =git add= automatically)
 334 : git init
 335 : git commit -a -m "some subsequent release"
 336
 337 Then push stuff up to the remote server:
 338 : git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
 339
 340 (=eval `ssh-agent`= and =ssh-add= to avoid having to type in keyphrase all
 341 the time)
 342
 343 Make a copy of the (remote) master branch:
 344 : git clone git://repo.or.cz/dmvccm.git
 345
 346 Make and name a new branch in this folder
 347 : git checkout -b mybranch
 348
 349 To save changes in =mybranch=:
 350 : git commit -a
 351
 352 Go back to the master branch (uncommitted changes from =mybranch= are
 353 carried over):
 354 : git checkout master
 355
 356 Try out:
 357 : git add --interactive
 358
 359 Good tutorial:
 360 http://www-cs-students.stanford.edu/~blynn//gitmagic/