1 \section{\module{robotparser
} ---
4 \declaremodule{standard
}{robotparser
}
5 \modulesynopsis{Loads a
\protect\file{robots.txt
} file and
6 answers questions about fetchability of other URLs.
}
7 \sectionauthor{Skip Montanaro
}{skip@mojam.com
}
10 \index{World Wide Web
}
14 This module provides a single class,
\class{RobotFileParser
}, which answers
15 questions about whether or not a particular user agent can fetch a URL on
16 the Web site that published the
\file{robots.txt
} file. For more details on
17 the structure of
\file{robots.txt
} files, see
18 \url{http://info.webcrawler.com/mak/projects/robots/norobots.html
}.
20 \begin{classdesc
}{RobotFileParser
}{}
22 This class provides a set of methods to read, parse and answer questions
23 about a single
\file{robots.txt
} file.
25 \begin{methoddesc
}{set_url
}{url
}
26 Sets the URL referring to a
\file{robots.txt
} file.
29 \begin{methoddesc
}{read
}{}
30 Reads the
\file{robots.txt
} URL and feeds it to the parser.
33 \begin{methoddesc
}{parse
}{lines
}
34 Parses the lines argument.
37 \begin{methoddesc
}{can_fetch
}{useragent, url
}
38 Returns
\code{True
} if the
\var{useragent
} is allowed to fetch the
\var{url
}
39 according to the rules contained in the parsed
\file{robots.txt
} file.
42 \begin{methoddesc
}{mtime
}{}
43 Returns the time the
\code{robots.txt
} file was last fetched. This is
44 useful for long-running web spiders that need to check for new
45 \code{robots.txt
} files periodically.
48 \begin{methoddesc
}{modified
}{}
49 Sets the time the
\code{robots.txt
} file was last fetched to the current
55 The following example demonstrates basic use of the RobotFileParser class.
58 >>> import robotparser
59 >>> rp = robotparser.RobotFileParser()
60 >>> rp.set_url("http://www.musi-cal.com/robots.txt")
62 >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
64 >>> rp.can_fetch("*", "http://www.musi-cal.com/")