Dome/Help/README

   1 Instructions for using Dome:
   2
   3 The main area shows the XML/HTML document you are editing on the right
   4 and the code to manipulate it on the left. Clicking the right-mouse button
   5 over the main area gets you a popup menu. You can use this, or the keyboard
   6 short-cuts to edit the document.
   7
   8 To move to a node, just click on it. Hold down Ctrl to select multiple
   9 nodes. Double-click a node to collapse/expand it. Shift-click selects
  10 all nodes between the node clicked and the previously selected node.
  11
  12 Most of the menu items are fairly obvious. Here are a few notes though:
  13
  14 - Blank document deletes everything.
  15
  16 - Insert adds a new node just before the current one. Append adds the new
  17   node just after. Open puts the new node inside.
  18
  19 - 'Shallow' yank/delete only work on the selected nodes themselves, whereas
  20   the normal yank/delete affect the child nodes too.
  21
  22 - Operations on the root node often can't be undone.
  23
  24 - See www.w3.org for details of the XPath syntax.
  25
  26 - Entering a node makes that node appear to be the root node. You can then
  27   edit that subtree without worrying about the rest of the document. It's
  28   faster too, because redrawing lots of stuff is really slow.
  29   When you're done, use Leave to get back out. Undo will undo all changes
  30   made between the enter and the leave in one go.
  31
  32 - Select by XPath is really useful for processing lists. Here are some
  33   useful paths to use:
  34
  35   - '*' selects all child elements of the current node.
  36   - 'li' selects all 'li' child nodes.
  37   - '//a' selects all anchor nodes anywhere in the document.
  38   - './/a' selects all anchor nodes anywhere inside this node.
  39
  40   More advanced stuff is also possible, eg:
  41
  42   - '//text()[ext:match('fred')][2]' selects the second text node containing
  43     'fred' within each element.
  44
  45 - HTTP GET replaces an anchor node with the HTML document it points to.
  46   Select the attribute with the URI before using this.
  47   HTTP POST is similar, but tries to send all non-namespaced attributes
  48   as POST data, except those with names starting 'header-', which are used
  49   as extra HTTP headers.
  50
  51 - Don't worry about SOAP messages.
  52
  53 - Substitute works on text nodes. Standard regexp format, eg:
  54         Replace: .*(\d\d\d\d).*
  55            With: \1
  56   will turn 'Born in 1957, England' into '1957'.
  57
  58 - Python expression lets you do more complex stuff, eg:
  59   - 'x.split()' splits the text into words.
  60   - 'int(x) + 1' increments a number.
  61   - 'x[:-1]' removes the last character.
  62
  63 - The Program menu is described below.
  64
  65 - Don't worry about the Show as ... items.
  66
  67
  68
  69 Once you can edit the document OK, you can start recording operations.
  70 The green area on the left shows the currently selected program ('Root' to
  71 begin with).
  72
  73 To record a sequence:
  74
  75 - Click on the black line below the 'Start' node (a yellow dot appears)
  76
  77 - Click on the Record button on the toolbar. The dot will turn red.
  78
  79 - Perform the operations on the document. Every operation you do will be
  80   added to the chain.
  81
  82 - Click Record again to stop recording.
  83
  84
  85 To play back the recorded sequence, right-click on 'Root' (just above the
  86 green area) and select Play. The yellow dot moves through the chain as each
  87 operation is performed. The four buttons after Record can be used to stop,
  88 step and resume playing.
  89
  90 You can also use this menu to create new programs. If you want to run a
  91 program once for each selected node, use Map instead of Play.
  92
  93
  94 The Program menu has these, which are only really useful when recording
  95 programs:
  96
  97 - Input pops up a dialog box. Whatever the user enters is placed on the
  98   clipboard.
  99
 100 - Compare succeeds if all selected nodes have the same value and structure.
 101
 102 - Fail always fails.
 103
 104
 105 Processing a web site:
 106
 107 - Start a new blank document.
 108
 109 - Begin recording the root program.
 110
 111 - Add an attribute 'uri'.
 112
 113 - Edit the value to the URI of the index page. Eg:
 114
 115         http://www.ibiblio.org/wm/paint/auth/
 116
 117 - Network -> HTTP suck.
 118
 119 - Select the 'ul' element, which contains everything we need.
 120
 121 - Yank it, then Paste Replace the body and delete the unneeded 'head'.
 122
 123 - Select the nodes you want. Eg:
 124
 125   - Click on the 'ul' element with the list of names.
 126   - Use Select -> By XPath to select all the 'li' nodes.
 127
 128 - Create a new program (right click on 'Root'). Call it 'Artist' or
 129   somesuch. This program will turn a 'li' into an 'Artist'.
 130
 131 - Right click on Artist and choose Map. This will run the Artist program
 132   on each li element.
 133
 134 - Since the Artist program is empty, execution stops immediately, inside the
 135   map operation. Above Root the message '1 frame' is displayed. This
 136   indicates that when the Artist sequence finishes, there is a suspended
 137   operation to return to.
 138
 139 - Click on Record to start recording the Artist program.
 140
 141 - Choose Move -> Enter so we can concentrate on just this element.
 142
 143 - Rename 'li' to 'Artist' and add 'Name' and 'Years' elements.
 144
 145 - Select the 'href' attribute and suck.
 146
 147 - Select the name and yank it. Put it in the Name element.
 148   Tip: Not all pages have the name in the 'strong' element, but they all
 149   have the name in the heading.
 150
 151 - Do a text search for '.*\(.*\d\d\d\d/.*\)'. This selects the first text
 152   node containing a four digit number in brackets. Without the escapes,
 153   it looks a bit clearer: '*(*DDDD*)'. Note the leading *.
 154
 155 - Yank this and put this in the Years element.
 156   Tip: use Home before clicking on Years so that this will work whereever
 157   the text was found. If you click directly on Years then it will be
 158   recorded as 'Up three parent nodes, then back one to Years'.
 159
 160 - Delete the html node.
 161
 162 - Leave.
 163
 164 - Stop recording by clicking on Record a second time. Click on Play to
 165   continue with the Map.
 166
 167 Dome will no process the rest of the site, until it hits a new error.
 168
 169
 170 Running in a cron job:
 171
 172 The GUI is quite slow. Once your program is working you can use the nogui.py
 173 program to run it without the frontend. The syntax is:
 174
 175         $ Dome/nogui.py project.dome
 176
 177 This runs the root program in project.dome. When it finishes, the result
 178 is saved back over project.dome and the data is exported as project.xml.
 179
 180 Tip: get your cron job to make backups too!
 181
 182
 183 Replaying:
 184
 185 Rechecking a whole site is slow. The solution is to use the output from the
 186 first scan in future scans. Each page you sucked should have gained an
 187 md5_sum attribute (and maybe a 'last-modified' too).
 188
 189 Eg, you should have output a bit like this:
 190
 191 <Artists uri='index.html' md5_sum='...' last-modified='...'>
 192         <Artist uri='a001.html' md5_sum='...' last-modified='...'>...
 193         <Artist uri='a002.html' md5_sum='...' last-modified='...'>...
 194         ...
 195 </Artists>
 196
 197 If you now do a suck on any of these nodes, Dome will check whether the page's
 198 contents have changed. If so, it pulls it in as normal. Otherwise, nothing
 199 happens.
 200
 201 The 'modified' attribute will be added or removed to indicate which happened.
 202 So, you can get your program to first try sucking the Artists node and then
 203 suck each of the 'Artist' ones. If nothing has changed, it will all go very
 204 quickly. To force a re-suck, delete the md5_sum and last-modified attributes.
 205
 206
 207 Problems? Comments? Email <tal00r@ecs.soton.ac.uk>