1 Instructions for using Dome:
3 The main area shows the XML/HTML document you are editing on the right
4 and the code to manipulate it on the left. Clicking the right-mouse button
5 over the main area gets you a popup menu. You can use this, or the keyboard
6 short-cuts to edit the document.
8 To move to a node, just click on it. Hold down Ctrl to select multiple
9 nodes. Double-click a node to collapse/expand it. Shift-click selects
10 all nodes between the node clicked and the previously selected node.
12 Most of the menu items are fairly obvious. Here are a few notes though:
14 - Blank document deletes everything.
16 - Insert adds a new node just before the current one. Append adds the new
17 node just after. Open puts the new node inside.
19 - 'Shallow' yank/delete only work on the selected nodes themselves, whereas
20 the normal yank/delete affect the child nodes too.
22 - Operations on the root node often can't be undone.
24 - See www.w3.org for details of the XPath syntax.
26 - Entering a node makes that node appear to be the root node. You can then
27 edit that subtree without worrying about the rest of the document. It's
28 faster too, because redrawing lots of stuff is really slow.
29 When you're done, use Leave to get back out. Undo will undo all changes
30 made between the enter and the leave in one go.
32 - Select by XPath is really useful for processing lists. Here are some
35 - '*' selects all child elements of the current node.
36 - 'li' selects all 'li' child nodes.
37 - '//a' selects all anchor nodes anywhere in the document.
38 - './/a' selects all anchor nodes anywhere inside this node.
40 More advanced stuff is also possible, eg:
42 - '//text()[ext:match('fred')][2]' selects the second text node containing
43 'fred' within each element.
45 - HTTP GET replaces an anchor node with the HTML document it points to.
46 Select the attribute with the URI before using this.
47 HTTP POST is similar, but tries to send all non-namespaced attributes
48 as POST data, except those with names starting 'header-', which are used
49 as extra HTTP headers.
51 - Don't worry about SOAP messages.
53 - Substitute works on text nodes. Standard regexp format, eg:
54 Replace: .*(\d\d\d\d).*
56 will turn 'Born in 1957, England' into '1957'.
58 - Python expression lets you do more complex stuff, eg:
59 - 'x.split()' splits the text into words.
60 - 'int(x) + 1' increments a number.
61 - 'x[:-1]' removes the last character.
63 - The Program menu is described below.
65 - Don't worry about the Show as ... items.
69 Once you can edit the document OK, you can start recording operations.
70 The green area on the left shows the currently selected program ('Root' to
75 - Click on the black line below the 'Start' node (a yellow dot appears)
77 - Click on the Record button on the toolbar. The dot will turn red.
79 - Perform the operations on the document. Every operation you do will be
82 - Click Record again to stop recording.
85 To play back the recorded sequence, right-click on 'Root' (just above the
86 green area) and select Play. The yellow dot moves through the chain as each
87 operation is performed. The four buttons after Record can be used to stop,
88 step and resume playing.
90 You can also use this menu to create new programs. If you want to run a
91 program once for each selected node, use Map instead of Play.
94 The Program menu has these, which are only really useful when recording
97 - Input pops up a dialog box. Whatever the user enters is placed on the
100 - Compare succeeds if all selected nodes have the same value and structure.
105 Processing a web site:
107 - Start a new blank document.
109 - Begin recording the root program.
111 - Add an attribute 'uri'.
113 - Edit the value to the URI of the index page. Eg:
115 http://www.ibiblio.org/wm/paint/auth/
117 - Network -> HTTP suck.
119 - Select the 'ul' element, which contains everything we need.
121 - Yank it, then Paste Replace the body and delete the unneeded 'head'.
123 - Select the nodes you want. Eg:
125 - Click on the 'ul' element with the list of names.
126 - Use Select -> By XPath to select all the 'li' nodes.
128 - Create a new program (right click on 'Root'). Call it 'Artist' or
129 somesuch. This program will turn a 'li' into an 'Artist'.
131 - Right click on Artist and choose Map. This will run the Artist program
134 - Since the Artist program is empty, execution stops immediately, inside the
135 map operation. Above Root the message '1 frame' is displayed. This
136 indicates that when the Artist sequence finishes, there is a suspended
137 operation to return to.
139 - Click on Record to start recording the Artist program.
141 - Choose Move -> Enter so we can concentrate on just this element.
143 - Rename 'li' to 'Artist' and add 'Name' and 'Years' elements.
145 - Select the 'href' attribute and suck.
147 - Select the name and yank it. Put it in the Name element.
148 Tip: Not all pages have the name in the 'strong' element, but they all
149 have the name in the heading.
151 - Do a text search for '.*\(.*\d\d\d\d/.*\)'. This selects the first text
152 node containing a four digit number in brackets. Without the escapes,
153 it looks a bit clearer: '*(*DDDD*)'. Note the leading *.
155 - Yank this and put this in the Years element.
156 Tip: use Home before clicking on Years so that this will work whereever
157 the text was found. If you click directly on Years then it will be
158 recorded as 'Up three parent nodes, then back one to Years'.
160 - Delete the html node.
164 - Stop recording by clicking on Record a second time. Click on Play to
165 continue with the Map.
167 Dome will no process the rest of the site, until it hits a new error.
170 Running in a cron job:
172 The GUI is quite slow. Once your program is working you can use the nogui.py
173 program to run it without the frontend. The syntax is:
175 $ Dome/nogui.py project.dome
177 This runs the root program in project.dome. When it finishes, the result
178 is saved back over project.dome and the data is exported as project.xml.
180 Tip: get your cron job to make backups too!
185 Rechecking a whole site is slow. The solution is to use the output from the
186 first scan in future scans. Each page you sucked should have gained an
187 md5_sum attribute (and maybe a 'last-modified' too).
189 Eg, you should have output a bit like this:
191 <Artists uri='index.html' md5_sum='...' last-modified='...'>
192 <Artist uri='a001.html' md5_sum='...' last-modified='...'>...
193 <Artist uri='a002.html' md5_sum='...' last-modified='...'>...
197 If you now do a suck on any of these nodes, Dome will check whether the page's
198 contents have changed. If so, it pulls it in as normal. Otherwise, nothing
201 The 'modified' attribute will be added or removed to indicate which happened.
202 So, you can get your program to first try sucking the Artists node and then
203 suck each of the 'Artist' ones. If nothing has changed, it will all go very
204 quickly. To force a re-suck, delete the md5_sum and last-modified attributes.
207 Problems? Comments? Email <tal00r@ecs.soton.ac.uk>