1 ! Copyright (C) 2008 Doug Coleman.
2 ! See http://factorcode.org/license.txt for BSD license.
3 USING: help.markup help.syntax io.streams.string urls
4 multiline spider.private quotations ;
9 { "base" "a string or url" }
11 { $description "Creates a new web spider with a given base url." } ;
17 { $description "Runs a spider until completion. See the " { $subsection "spider-tutorial" } " for a complete description of the tuple slots that affect how thet spider works." } ;
19 HELP: slurp-heap-while
21 { "heap" "a heap" } { "quot1" quotation } { "quot2" quotation } }
22 { $description "Removes values from a heap that match the predicate quotation " { $snippet "quot1" } " and processes them with " { $snippet "quot2" } " until the predicate quotation no longer matches." } ;
24 ARTICLE: "spider-tutorial" "Spider tutorial"
25 "To create a new spider, call the " { $link <spider> } " word with a link to the site you wish to spider."
26 { $code <" "http://concatentative.org" <spider> "> }
27 "The max-depth is initialized to 0, which retrieves just the initial page. Let's initialize it to something more fun:"
28 { $code <" 1 >>max-depth "> }
29 "Now the spider will retrieve the first page and all the pages it links to in the same domain." $nl
30 "But suppose the front page contains thousands of links. To avoid grabbing them all, we can set " { $slot "max-count" } " to a reasonable limit."
31 { $code <" 10 >>max-count "> }
32 "A timeout might keep the spider from hitting the server too hard:"
33 { $code <" USE: calendar 1.5 seconds >>sleep "> }
34 "Since we happen to know that not all pages of a wiki are suitable for spidering, we will spider only the wiki view pages, not the edit or revisions pages. To do this, we add a filter through which new links are tested; links that pass the filter are added to the todo queue, while links that do not are discarded. You can add several filters to the filter array, but we'll just add a single one for now."
35 { $code <" { [ path>> "/wiki/view" head? ] } >>filters "> }
36 "Finally, to start the spider, call the " { $link run-spider } " word."
37 { $code "run-spider" }
38 "The full code from the tutorial."
39 { $code <" USING: spider calendar sequences accessors ;
40 : spider-concatenative ( -- spider )
41 "http://concatenative.org" <spider>
45 { [ path>> "/wiki/view" head? ] } >>filters
48 ARTICLE: "spider" "Spider"
49 "The " { $vocab-link "spider" } " vocabulary implements a simple web spider for retrieving sets of webpages."
50 { $subsection "spider-tutorial" }
51 "Creating a new spider:"
52 { $subsection <spider> }
54 { $subsection run-spider } ;