1 # -*- coding: mule-utf-8-unix -*-
4 #+TAGS: OPTIMIZE PRETTIER
6 #+TITLE: DMV/CCM -- todo-list / progress
7 #+AUTHOR: Kevin Brubeck Unhammer
8 #+EMAIL: K.BrubeckUnhammer at student uva nl
11 #+SEQ_TODO: TOGROK TODO DONE
13 [[file:src/main.py][main.py]]
14 [[file:src/wsjdep.py][wsjdep.py]]
15 [[file:src/loc_h_dmv.py][loc_h_dmv.py]]
18 - fix stop and attachment formulas so they divide before summing
19 - both in formulas.tex and in code
20 - and check with other values for P_ORDER
22 [[file:DMVCCM.html][DMVCCM.html]]
24 * DMV/CCM report and project
25 DEADLINE: <2008-09-21 Sun>
26 - [[http://www.student.uib.no/~kun041/dmvccm/report.pdf][report.pdf]] -- Draft report for the whole project, including formulas
27 for the full algorithms
29 - [[file:src/main.py][main.py]] -- evaluation, corpus likelihoods
30 - [[file:src/wsjdep.py][wsjdep.py]] -- corpus reader for the dependency parsed WSJ
32 - [[file:src/loc_h_dmv.py][loc_h_dmv.py]] -- DMV-IO and reestimation
33 - [[file:src/loc_h_harmonic.py][loc_h_harmonic.py]] -- DMV initialization
35 - [[file:src/common_dmv.py][common_dmv.py]] -- various functions used by loc_h_dmv and others
36 - [[file:src/io.py][io.py]] -- non-DMV IO
39 - [[file:src/cnf_dmv.py][cnf_dmv.py]] -- cnf-like implementation of DMV
40 - [[file:src/cnf_harmonic.py][cnf_harmonic.py]] -- initialization for cnf_dmv
42 [[http://www.student.uib.no/~kun041/dmvccm/DMVCCM_archive.html][Archived entries]] from this file.
44 : old notes: new notes: in tex/code (constants): in Klein thesis:
45 :--------------------------------------------------------------------------------------
46 : _h_ _h_ SEAL bar over h
47 : h_ h>< RGOL right-under-left-arrow over h
48 : h h> GOR right-arrow over h
50 : ><h LGOR left-under-right-arrow over h
51 : <h GOL left-arrow over h
52 These are represented in the code as pairs =(s_h,h)=, where =h= is an
53 integer (POS-tag) and =s_h= \in ={SEAL,RGOL,GOR,LGOR,GOL}=.
55 =P_ATTACH= and =P_CHOOSE= are synonymous, I try to use the
57 : P_GO_AT(a|h,dir,adj) := P_ATTACH(a|h,dir)*(1-P_STOP(STOP|h,dir,adj)
59 (precalculated after each reestimation with =g.p_GO_AT = make_GO_AT(g.p_STOP,g.p_ATTACH)=)
60 ** COMMENT qtrees, tex
64 \newcommand{\GOR}[1]{\overrightarrow{#1}}
65 \newcommand{\RGOL}[1]{\overleftarrow{\overrightarrow{#1}}}
67 \newcommand{\SEAL}[1]{\overline{#1}}
69 \newcommand{\LGOR}[1]{\overrightarrow{\overleftarrow{#1}}}
70 \newcommand{\GOL}[1]{\overleftarrow{#1}}
72 \Tree [.{$\RGOL{h}$} [.{$_s \SEAL{a} _t$\\
73 Node} ] {$_{t+1} \RGOL{h} _r$\\
75 \Tree [.{$\GOR{h}$} {$_{s} \GOR{h} _{t}$\\
76 Node} [.{$_{t+1} \SEAL{a} _r$\\
78 \Tree [.{$\RGOL{h}$} [.{$_r \SEAL{a} _{s-1}$\\
79 L} ] {$_{s} \RGOL{h} _t$\\
81 \Tree [.{$\GOR{h}$} {$_{r} \GOR{h} _{s-1}$\\
82 L} [.{$_{s} \SEAL{a} _t$\\
86 \Tree [.{$h\urcorner$} [.{$_s \ulcorner a\urcorner _t$\\
87 Node} ] {$_{t+1} h\urcorner _r$\\
89 \Tree [.{$h$} {$_{s} h _{t}$\\
90 Node} [.{$_{t+1} \ulcorner a\urcorner _r$\\
92 \Tree [.{$h\urcorner$} [.{$_r \ulcorner a\urcorner _{s-1}$\\
93 L} ] {$_{s} h\urcorner _t$\\
95 \Tree [.{$h$} {$_{r} h _{s-1}$\\
96 L} [.{$_{s} \ulcorner a\urcorner _t$\\
98 * Testing the dependency parsed WSJ
99 [[file:src/wsjdep.py][wsjdep.py]] uses NLTK (sort of) to get a dependency parsed version of
100 WSJ10 into the format used in mpp() in loc_h_dmv.py.
102 As a default, =WSJDepCorpusReader= looks for the file =wsj.combined.10.dep= in
105 Only =sents()=, =tagged_sents()= and =parsed_sents()= (plus a new function
106 =tagonly_sents()=) are implemented, the other NLTK corpus functions are
108 ** TODO [#A] Should =def evaluate= use add_root?
109 [[file:src/main.py::def%20evaluate%20g%20tagonly_sents%20parsed_sents][main.py]] evaluate
110 [[file:src/wsjdep.py][wsjdep.py]] add_root
112 (just has to count how many pairs are in there; Precision and Recall)
113 * TOGROK Combine CCM with DMV
117 Questions about the =P_COMBO= info in [[http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf][Klein's thesis]]:
118 - Page 109 (pdf: 125): We have to premultiply "all our probabilities"
119 by the CCM base product /\Pi_{<i,j>}
120 P_{SPAN}(\alpha(i,j,s)|false)P_{CONTEXT}(\beta(i,j,s)|false)/; which
121 probabilities are included under "all"? I'm assuming this includes
122 =P_ATTACH= since each time =P_ATTACH= is used, /\phi/ is multiplied in
123 (pp.110-111 ibid.); but /\phi/ is not used for STOPs, so should we not
124 have our CCM product multiplied in there? How about =P_ROOT=?
125 (Guessing =P_ORDER= is way out of the question...)
126 - For the outside probabilities, is it correct to assume we multiply
127 in /\phi(j,k)/ or /\phi(k,i)/ when calculating =inner(i,j...)=? (Eg., only
128 for the outside part, not for the whole range.) I don't understand
129 the notation in =O()= on p.103.
130 * TOGROK Reestimate P_ORDER ?
131 * Most Probable Parse
132 ** TOGROK Find MPP with CCM
133 ** DONE Find Most Probable Parse of given test sentence, in DMV
134 CLOSED: [2008-07-23 Wed 10:56]
135 inner() optionally keeps track of the highest probability children of
136 any node in =mpptree=. Say we're looking for =inner(i,j,(s_h,h),loc_h)= in
137 a certain sentence, and we find some possible left and right children,
138 we add to =mpptree[i,j,(s_h,h),loc_h]= the triple =(p, L, R)= where =L= and
139 =R= are of the same form as the key (=i,j,(s_h,h),loc_h=) and =p= is the
140 probability of this node rewriting to =L= and =R=,
141 eg. =inner(L)*inner(R)*p_GO_AT= or =p_STOP= or whatever. We only add this
142 entry to =mpptree= if there wasn't a higher-probability entry there
145 Then, after =inner_sent= makes an =mpptree=, we find the /relevant/
146 head-argument pairs by searching through the tree using a queue,
147 adding the =L= and =R= keys of any entry to the queue as we find them
148 (skipping =STOP= keys), and adding any attachment entries to a set of
149 triples =(head,argument,dir)=. Thus we have our most probable parse,
151 : set([( ROOT, (vbd,2),RIGHT),
152 : ((vbd,2),(nn,1),LEFT),
153 : ((vbd,2),(nn,3),RIGHT),
154 : ((nn,1),(det,0),LEFT)])
156 [[file:~/Documents/Skole/V08/Probability/dmvccm/src/dmv.py::Initialization%20todo][dmv-inits]]
158 We go through the corpus, since the probabilities are based on how far
159 away in the sentence arguments are from their heads.
160 ** TOGROK CCM Initialization
161 P_{SPLIT} used here... how, again?
162 * TODO [#C] Alternative CNF for DMV
165 - [[file:src/cnf_dmv.py][cnf_dmv.py]]
166 - [[file:src/cnf_harmonic.py][cnf_harmonic.py]]
168 See section 5 of [[file:tex/formulas.pdf][formulas.pdf]].
170 Given a grammar with certain p_ATTACH, p_STOP and p_ROOT, we get:
171 :>>> print testgrammar_h():
172 : h>< --> h> STOP [0.30]
173 : h>< --> >h> STOP [0.40]
174 : _h_ --> STOP h>< [1.00]
175 : _h_ --> STOP <h>< [1.00]
176 : >h> --> h> _h_ [1.00]
177 : >h> --> >h> _h_ [1.00]
178 : <h>< --> _h_ h>< [0.70]
179 : <h>< --> _h_ <h>< [0.60]
180 :ROOT --> STOP _h_ [1.00]
182 ** TODO [#A] Make and implement an equivalent grammar that's /pure/ CNF
183 ...since I'm not sure about my unary reestimation rules (section 5 of
184 [[file:tex/formulas.pdf][formulas]]).
186 For any rule where LHS is =_h_= we also have a corresponding one with
187 LHS =ROOT=, only difference being that we multiply in =p_ROOT(h)=.
189 For any rule where LHS is =.h>=, we use adjacent probabilities for the
190 left child; if LHS is =<h.= we use adjacent probabilities for the right
191 child. Only =_h_= and =_h>_= (plus =ROOT=) get to introduce the pre-terminal
192 =h= (where =h=, =ROOT= and =_h_= all rewrite to the terminal
193 @<code>'h'@</code>), and only =_h_= and =_h>_= (plus =ROOT=) act as STOP
194 rules (eg. get to multiply in =p(STOP)=).
197 : _h_ --> 'h' p(STOP|h,L,adj) * p(STOP|h,R,adj)
198 : ROOT --> 'h' p(STOP|h,L,adj) * p(STOP|h,R,adj) * p_ROOT(h)
200 : _h_ --> h _a_ p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
201 : _h_ --> h .h> p(STOP|h,L,adj) * p(STOP|h,R,non)
202 : .h> --> _a_ _b_ p(a|h,R)*p(-STOP|h,R,adj) * p(b|h,R)*p(-STOP|h,R,non)
203 : .h> --> _a_ h> p(a|h,R)*p(-STOP|h,R,adj)
204 : h> --> _a_ _b_ p(a|h,R)*p(-STOP|h,R,non) * p(b|h,R)*p(-STOP|h,R,non)
205 : h> --> _a_ h> p(a|h,R)*p(-STOP|h,R,non)
207 : _h_ --> _a_ h p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj)
208 : _h_ --> <h. h p(STOP|h,L,non) * p(STOP|h,R,adj)
209 : <h. --> _b_ _a_ p(b|h,L)*p(-STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
210 : <h. --> <h _a_ p(a|h,L)*p(-STOP|h,L,adj)
211 : <h --> _a_ _b_ p(a|h,L)*p(-STOP|h,L,non) * p(b|h,L)*p(-STOP|h,L,non)
212 : <h --> <h _a_ p(a|h,L)*p(-STOP|h,L,non)
214 : _h_ --> <h. _h>_ p(STOP|h,L,non)
215 : _h_ --> _a_ _h>_ p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
216 : _h>_ --> h .h> p(STOP|h,R,non)
217 : _h>_ --> h _a_ p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
219 : ROOT --> h _a_ p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj) * p_ROOT(h)
220 : ROOT --> h .h> p(STOP|h,L,adj) * p(STOP|h,R,non) * p_ROOT(h)
222 : ROOT --> _a_ h p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
223 : ROOT --> <h. h p(STOP|h,L,non) * p(STOP|h,R,adj) * p_ROOT(h)
225 : ROOT --> <h. _h>_ p(STOP|h,L,non) * p_ROOT(h)
226 : ROOT --> _a_ _h>_ p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
229 Since we have rules rewriting =h= to =a= and =b=, we have a rule-set
230 numbering more than n_{tags}^{2}.
232 ** TOGROK [#A] convert L&Y-based reestimation into P_ATTACH and P_STOP values
233 Sum over the various rules? Or something? Must think of this.
234 ** TODO [#C] move as much as possible into common_dmv.py
235 [[file:src/common_dmv.py][common_dmv.py]]
236 ** DONE L&Y-based reestimation for cnf_dmv
237 CLOSED: [2008-08-21 Thu 16:35]
238 ** DONE dmv2cnf re-estimation formulas
239 CLOSED: [2008-08-21 Thu 16:36]
240 ** DONE inner and outer for cnf_dmv.py, also cnf_harmonic.py
242 http://wiki.python.org/moin/PythonSpeed/PerformanceTips Eg., use
243 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
244 for-loops; use local variables for globals (global variables or or
246 ** TODO Clean up reestimation code :PRETTIER:
247 ** TODO [#A] compare speed of w_left/right(...) and w(LEFT/RIGHT, ...) :OPTIMIZE:
248 ** TODO when reestimating P_STOP etc, remove rules with p < epsilon :OPTIMIZE:
249 ** TODO inner_dmv, short ranges and impossible attachment :OPTIMIZE:
250 If s-t <= 2, there can be only one attachment below, so don't recurse
251 with both Lattach=True and Rattach=True.
253 If s-t <= 1, there can be no attachment below, so only recurse with
254 Lattach=False, Rattach=False.
256 Put this in the loop under rewrite rules (could also do it in the STOP
257 section, but that would only have an effect on very short sentences).
258 ** TODO clean up the module files :PRETTIER:
259 Is there better way to divide dmv and harmonic? There's a two-way
260 dependency between the modules. Guess there could be a third file that
261 imports both the initialization and the actual EM stuff, while a file
262 containing constants and classes could be imported by all others:
263 : dmv.py imports dmv_EM.py imports dmv_classes.py
264 : dmv.py imports dmv_inits.py imports dmv_classes.py
266 ** TOGROK Some (tagged) sentences are bound to come twice :OPTIMIZE:
267 Eg, first sort and count, so that the corpus
268 [['nn','vbd','det','nn'],
269 ['vbd','nn','det','nn'],
270 ['nn','vbd','det','nn']]
272 [(['nn','vbd','det','nn'],2),
273 (['vbd','nn','det','nn'],1)]
274 and then in each loop through sentences, make sure we handle the
277 Is there much to gain here?
279 ** TOGROK tags as numbers or tags as strings? :OPTIMIZE:
280 Need to clean up the representation.
282 Stick with tag-strings in initialization then switch to numbers for
283 IO-algorithm perhaps? Can probably afford more string-matching in
285 * Adjacency and combining it with the inside-outside algorithm
286 Each DMV probability (for a certain PCFG node) has both an adjacent
287 and a non-adjacent probability. inner() and outer() needs the correct
290 In each inner() call, loc_h is the location of the head of this
291 dependency structure. In each outer() call, it's the head of the /Node/,
292 the structure we're looking outside of.
294 We call inner() for each location of a head, and on each terminal,
295 loc_h must equal =i= (and =loc_h+1= equal =j=). In the recursive attachment
296 calls, we use the locations (sentence indices) of words to the left or
297 right of the head in calls to inner(). /loc_h lets us check whether we
298 need probN or probA/.
299 ** Possible alternate type of adjacency
300 K&M's adjacency is just whether or not an argument has been generated
301 in the current direction yet. One could also make a stronger type of
302 adjacency, where h and a are not adjacent if b is in between, eg. with
303 the sentence "a b h" and the structure ((h->a), (a->b)), h is
304 K&M-adjacent to a, but not next to a, since b is in between. It's easy
305 to check this type of adjacency in inner(), but it needs new rules for
309 Make those debug statements steal a bit less attention in emacs:
310 :(font-lock-add-keywords
311 : 'python-mode ; not really regexp, a bit slow
312 : '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
313 :\\1 .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
314 :(font-lock-add-keywords
316 : '(("\\<\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
318 - [[file:src/pseudo.py][pseudo.py]]
319 - http://nltk.org/doc/en/structured-programming.html recursive dynamic
320 - http://nltk.org/doc/en/advanced-parsing.html
321 - http://jaynes.colorado.edu/PythonIdioms.html
326 Repository web page: http://repo.or.cz/w/dmvccm.git
328 Setting up a new project:
331 : git commit -m "first release"
333 Later on: (=-a= does =git rm= and =git add= automatically)
335 : git commit -a -m "some subsequent release"
337 Then push stuff up to the remote server:
338 : git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
340 (=eval `ssh-agent`= and =ssh-add= to avoid having to type in keyphrase all
343 Make a copy of the (remote) master branch:
344 : git clone git://repo.or.cz/dmvccm.git
346 Make and name a new branch in this folder
347 : git checkout -b mybranch
349 To save changes in =mybranch=:
352 Go back to the master branch (uncommitted changes from =mybranch= are
354 : git checkout master
357 : git add --interactive
360 http://www-cs-students.stanford.edu/~blynn//gitmagic/