diff options
| author | Craig Jennings <c@cjennings.net> | 2024-04-07 13:41:34 -0500 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2024-04-07 13:41:34 -0500 |
| commit | 754bbf7a25a8dda49b5d08ef0d0443bbf5af0e36 (patch) | |
| tree | f1190704f78f04a2b0b4c977d20fe96a828377f1 /devdocs/python~3.12/library%2Furllib.robotparser.html | |
new repository
Diffstat (limited to 'devdocs/python~3.12/library%2Furllib.robotparser.html')
| -rw-r--r-- | devdocs/python~3.12/library%2Furllib.robotparser.html | 51 |
1 files changed, 51 insertions, 0 deletions
diff --git a/devdocs/python~3.12/library%2Furllib.robotparser.html b/devdocs/python~3.12/library%2Furllib.robotparser.html new file mode 100644 index 00000000..116f14f1 --- /dev/null +++ b/devdocs/python~3.12/library%2Furllib.robotparser.html @@ -0,0 +1,51 @@ + <span id="urllib-robotparser-parser-for-robots-txt"></span><h1>urllib.robotparser — Parser for robots.txt</h1> <p><strong>Source code:</strong> <a class="reference external" href="https://github.com/python/cpython/tree/3.12/Lib/urllib/robotparser.py">Lib/urllib/robotparser.py</a></p> <p>This module provides a single class, <a class="reference internal" href="#urllib.robotparser.RobotFileParser" title="urllib.robotparser.RobotFileParser"><code>RobotFileParser</code></a>, which answers questions about whether or not a particular user agent can fetch a URL on the web site that published the <code>robots.txt</code> file. For more details on the structure of <code>robots.txt</code> files, see <a class="reference external" href="http://www.robotstxt.org/orig.html">http://www.robotstxt.org/orig.html</a>.</p> <dl class="py class"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser"> +<code>class urllib.robotparser.RobotFileParser(url='')</code> </dt> <dd> +<p>This class provides methods to read, parse and answer questions about the <code>robots.txt</code> file at <em>url</em>.</p> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.set_url"> +<code>set_url(url)</code> </dt> <dd> +<p>Sets the URL referring to a <code>robots.txt</code> file.</p> </dd> +</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.read"> +<code>read()</code> </dt> <dd> +<p>Reads the <code>robots.txt</code> URL and feeds it to the parser.</p> </dd> +</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.parse"> +<code>parse(lines)</code> </dt> <dd> +<p>Parses the lines argument.</p> </dd> +</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.can_fetch"> +<code>can_fetch(useragent, url)</code> </dt> <dd> +<p>Returns <code>True</code> if the <em>useragent</em> is allowed to fetch the <em>url</em> according to the rules contained in the parsed <code>robots.txt</code> file.</p> </dd> +</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.mtime"> +<code>mtime()</code> </dt> <dd> +<p>Returns the time the <code>robots.txt</code> file was last fetched. This is useful for long-running web spiders that need to check for new <code>robots.txt</code> files periodically.</p> </dd> +</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.modified"> +<code>modified()</code> </dt> <dd> +<p>Sets the time the <code>robots.txt</code> file was last fetched to the current time.</p> </dd> +</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.crawl_delay"> +<code>crawl_delay(useragent)</code> </dt> <dd> +<p>Returns the value of the <code>Crawl-delay</code> parameter from <code>robots.txt</code> for the <em>useragent</em> in question. If there is no such parameter or it doesn’t apply to the <em>useragent</em> specified or the <code>robots.txt</code> entry for this parameter has invalid syntax, return <code>None</code>.</p> <div class="versionadded"> <p><span class="versionmodified added">New in version 3.6.</span></p> </div> </dd> +</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.request_rate"> +<code>request_rate(useragent)</code> </dt> <dd> +<p>Returns the contents of the <code>Request-rate</code> parameter from <code>robots.txt</code> as a <a class="reference internal" href="../glossary#term-named-tuple"><span class="xref std std-term">named tuple</span></a> <code>RequestRate(requests, seconds)</code>. If there is no such parameter or it doesn’t apply to the <em>useragent</em> specified or the <code>robots.txt</code> entry for this parameter has invalid syntax, return <code>None</code>.</p> <div class="versionadded"> <p><span class="versionmodified added">New in version 3.6.</span></p> </div> </dd> +</dl> <dl class="py method"> <dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.site_maps"> +<code>site_maps()</code> </dt> <dd> +<p>Returns the contents of the <code>Sitemap</code> parameter from <code>robots.txt</code> in the form of a <a class="reference internal" href="stdtypes#list" title="list"><code>list()</code></a>. If there is no such parameter or the <code>robots.txt</code> entry for this parameter has invalid syntax, return <code>None</code>.</p> <div class="versionadded"> <p><span class="versionmodified added">New in version 3.8.</span></p> </div> </dd> +</dl> </dd> +</dl> <p>The following example demonstrates basic use of the <a class="reference internal" href="#urllib.robotparser.RobotFileParser" title="urllib.robotparser.RobotFileParser"><code>RobotFileParser</code></a> class:</p> <pre data-language="python">>>> import urllib.robotparser +>>> rp = urllib.robotparser.RobotFileParser() +>>> rp.set_url("http://www.musi-cal.com/robots.txt") +>>> rp.read() +>>> rrate = rp.request_rate("*") +>>> rrate.requests +3 +>>> rrate.seconds +20 +>>> rp.crawl_delay("*") +6 +>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") +False +>>> rp.can_fetch("*", "http://www.musi-cal.com/") +True +</pre> <div class="_attribution"> + <p class="_attribution-p"> + © 2001–2023 Python Software Foundation<br>Licensed under the PSF License.<br> + <a href="https://docs.python.org/3.12/library/urllib.robotparser.html" class="_attribution-link">https://docs.python.org/3.12/library/urllib.robotparser.html</a> + </p> +</div> |
