summaryrefslogtreecommitdiff
path: root/devdocs/python~3.12/library%2Furllib.robotparser.html
blob: 116f14f1cc62a3b2836f3578a34b9e58f3b1877f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
 <span id="urllib-robotparser-parser-for-robots-txt"></span><h1>urllib.robotparser — Parser for robots.txt</h1> <p><strong>Source code:</strong> <a class="reference external" href="https://github.com/python/cpython/tree/3.12/Lib/urllib/robotparser.py">Lib/urllib/robotparser.py</a></p>  <p>This module provides a single class, <a class="reference internal" href="#urllib.robotparser.RobotFileParser" title="urllib.robotparser.RobotFileParser"><code>RobotFileParser</code></a>, which answers questions about whether or not a particular user agent can fetch a URL on the web site that published the <code>robots.txt</code> file. For more details on the structure of <code>robots.txt</code> files, see <a class="reference external" href="http://www.robotstxt.org/orig.html">http://www.robotstxt.org/orig.html</a>.</p> <dl class="py class"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser">
<code>class urllib.robotparser.RobotFileParser(url='')</code> </dt> <dd>
<p>This class provides methods to read, parse and answer questions about the <code>robots.txt</code> file at <em>url</em>.</p> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.set_url">
<code>set_url(url)</code> </dt> <dd>
<p>Sets the URL referring to a <code>robots.txt</code> file.</p> </dd>
</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.read">
<code>read()</code> </dt> <dd>
<p>Reads the <code>robots.txt</code> URL and feeds it to the parser.</p> </dd>
</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.parse">
<code>parse(lines)</code> </dt> <dd>
<p>Parses the lines argument.</p> </dd>
</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.can_fetch">
<code>can_fetch(useragent, url)</code> </dt> <dd>
<p>Returns <code>True</code> if the <em>useragent</em> is allowed to fetch the <em>url</em> according to the rules contained in the parsed <code>robots.txt</code> file.</p> </dd>
</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.mtime">
<code>mtime()</code> </dt> <dd>
<p>Returns the time the <code>robots.txt</code> file was last fetched. This is useful for long-running web spiders that need to check for new <code>robots.txt</code> files periodically.</p> </dd>
</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.modified">
<code>modified()</code> </dt> <dd>
<p>Sets the time the <code>robots.txt</code> file was last fetched to the current time.</p> </dd>
</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.crawl_delay">
<code>crawl_delay(useragent)</code> </dt> <dd>
<p>Returns the value of the <code>Crawl-delay</code> parameter from <code>robots.txt</code> for the <em>useragent</em> in question. If there is no such parameter or it doesn’t apply to the <em>useragent</em> specified or the <code>robots.txt</code> entry for this parameter has invalid syntax, return <code>None</code>.</p> <div class="versionadded"> <p><span class="versionmodified added">New in version 3.6.</span></p> </div> </dd>
</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.request_rate">
<code>request_rate(useragent)</code> </dt> <dd>
<p>Returns the contents of the <code>Request-rate</code> parameter from <code>robots.txt</code> as a <a class="reference internal" href="../glossary#term-named-tuple"><span class="xref std std-term">named tuple</span></a> <code>RequestRate(requests, seconds)</code>. If there is no such parameter or it doesn’t apply to the <em>useragent</em> specified or the <code>robots.txt</code> entry for this parameter has invalid syntax, return <code>None</code>.</p> <div class="versionadded"> <p><span class="versionmodified added">New in version 3.6.</span></p> </div> </dd>
</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.site_maps">
<code>site_maps()</code> </dt> <dd>
<p>Returns the contents of the <code>Sitemap</code> parameter from <code>robots.txt</code> in the form of a <a class="reference internal" href="stdtypes#list" title="list"><code>list()</code></a>. If there is no such parameter or the <code>robots.txt</code> entry for this parameter has invalid syntax, return <code>None</code>.</p> <div class="versionadded"> <p><span class="versionmodified added">New in version 3.8.</span></p> </div> </dd>
</dl> </dd>
</dl> <p>The following example demonstrates basic use of the <a class="reference internal" href="#urllib.robotparser.RobotFileParser" title="urllib.robotparser.RobotFileParser"><code>RobotFileParser</code></a> class:</p> <pre data-language="python">&gt;&gt;&gt; import urllib.robotparser
&gt;&gt;&gt; rp = urllib.robotparser.RobotFileParser()
&gt;&gt;&gt; rp.set_url("http://www.musi-cal.com/robots.txt")
&gt;&gt;&gt; rp.read()
&gt;&gt;&gt; rrate = rp.request_rate("*")
&gt;&gt;&gt; rrate.requests
3
&gt;&gt;&gt; rrate.seconds
20
&gt;&gt;&gt; rp.crawl_delay("*")
6
&gt;&gt;&gt; rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
&gt;&gt;&gt; rp.can_fetch("*", "http://www.musi-cal.com/")
True
</pre> <div class="_attribution">
  <p class="_attribution-p">
    &copy; 2001&ndash;2023 Python Software Foundation<br>Licensed under the PSF License.<br>
    <a href="https://docs.python.org/3.12/library/urllib.robotparser.html" class="_attribution-link">https://docs.python.org/3.12/library/urllib.robotparser.html</a>
  </p>
</div>