<?xml version="1.0" encoding="utf-8"?><!DOCTYPE article  PUBLIC '-//OASIS//DTD DocBook XML V4.4//EN'  'http://www.docbook.org/xml/4.4/docbookx.dtd'><article><articleinfo><title>HelpOnRobots</title></articleinfo><section><title>Robots</title><para>Robots (known also as crawlers and spiders) are programs which navigate Internet sites and download the content without explicit supervision, typically for the purpose of building a database to be used by search engines, although they may also be engaged in other forms of data-mining. Although such programs may perform a useful purpose, they can also cause a high load on sites they visit by requesting large numbers of pages and other resources; they can also expose content from a site that is of little general interest, drowning out the interesting content in the eventual search engine results produced for the site. </para><para><inlinemediaobject><imageobject><imagedata depth="16" fileref="http://ei-www.hyogo-dai.ac.jp/~etsuo/moin_static198/moniker/img/idea.png" width="16"/></imageobject><textobject><phrase>(!)</phrase></textobject></inlinemediaobject> Note that MoinMoin's own search functionality does not depend on having robots access the pages of a wiki. See <ulink url="http://ei-www.hyogo-dai.ac.jp/%7Eetsuo/moin.cgi/HelpOnRobots/~etsuo/moin.cgi/HelpOnXapian#">HelpOnXapian</ulink> for details of the search engine indexing that can be done internally within MoinMoin. </para><para>MoinMoin controls robots through the following mechanisms: </para><informaltable><tgroup cols="2"><colspec colname="col_0"/><colspec colname="col_1"/><tbody><row rowsep="1"><entry colsep="1" rowsep="1"><para> <emphasis role="strong">Name</emphasis> </para></entry><entry colsep="1" rowsep="1"><para> <emphasis role="strong">Description</emphasis> </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> ua_spiders </para></entry><entry colsep="1" rowsep="1"><para> This <ulink url="http://ei-www.hyogo-dai.ac.jp/%7Eetsuo/moin.cgi/HelpOnRobots/~etsuo/moin.cgi/HelpOnConfiguration#spam_leech_dos">configuration setting</ulink> controls access to actions, preventing such programs from visiting things like past page revisions (through the &quot;info&quot; action) or from attempting to change the content on a MoinMoin site in some way. </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> html_head_index </para></entry><entry align="center" colsep="1" morerows="3" nameend="col_1" namest="col_1" rowsep="1"><para> These <ulink url="http://ei-www.hyogo-dai.ac.jp/%7Eetsuo/moin.cgi/HelpOnRobots/~etsuo/moin.cgi/HelpOnConfiguration#various">configuration settings</ulink> control the &lt;HEAD&gt; tags that appear in HTML output. Since some robots process &lt;META&gt; tags and observe instructions related to link-following, these settings may be changed to influence how robots navigate a site. </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> html_head_normal </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> html_head_posts </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> html_head_queries </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> robots.txt </para></entry><entry colsep="1" rowsep="1"><para> MoinMoin deploys a collection of static resources inside a directory called <code>htdocs</code>, including a file called <code>robots.txt</code>. By editing this file, robots can be persuaded to access (or not access) a site. </para></entry></row></tbody></tgroup></informaltable><section><title>Making a site more publicly searchable</title><para>By default, MoinMoin tends to forbid robots, deny access by robots to actions, and instruct robots not to follow links if they should end up accessing pages on a wiki. Although a wiki configured in this fashion may still appear in search engine results, mostly because other sites may have linked to pages on such a wiki, many pages will be unknown to such search engines. </para><para>To permit indexing by robots... </para><orderedlist numeration="arabic"><listitem><para>Change the <code>robots.txt</code> file to resemble the following: </para><screen><![CDATA[User-agent: *
Allow: /
Crawl-delay: 20]]></screen><para>This allows any robot (<code>User-agent</code>), but asks that they only access the site every 20 seconds or less frequently. You can specify a more specific pattern to only permit certain robots, and add multiple sections describing the allowed access rules for different robots. Make sure that the URL path given for <code>Allow</code> matches the root of the wiki or the part of the wiki that should be indexed. </para></listitem><listitem><para>In your <ulink url="http://ei-www.hyogo-dai.ac.jp/%7Eetsuo/moin.cgi/HelpOnRobots/~etsuo/moin.cgi/HelpOnConfiguration#">configuration</ulink> change or add the <code>html_head_normal</code> setting so that it resembles the following: </para><programlisting format="linespecific" language="python" linenumbering="numbered" startinglinenumber="1"><methodname><![CDATA[html_head_normal]]></methodname><![CDATA[ = ]]><phrase><![CDATA[']]></phrase><phrase><![CDATA[<meta name=]]></phrase><phrase><![CDATA["]]></phrase><phrase><![CDATA[robots]]></phrase><phrase><![CDATA["]]></phrase><phrase><![CDATA[ content=]]></phrase><phrase><![CDATA["]]></phrase><phrase><![CDATA[index,follow]]></phrase><phrase><![CDATA["]]></phrase><phrase><![CDATA[>]]></phrase><![CDATA[
]]><phrase><![CDATA[']]></phrase>
</programlisting><para>This lets robots know that they should index normal pages and follow the links to find other pages. These settings can be changed to <code>noindex</code> and <code>nofollow</code> to instruct robots to do the opposite. </para></listitem></orderedlist><para>Note that robots are free to ignore the instructions given, although doing so is seen as bad practice, and so the more well-known search engine services tend to observe the instructions rather than risk damaging their reputation by ignoring them. </para></section><section><title>Making actions accessible by certain clients</title><para>Some users wish to use programs other than their normal web browser to access wiki content. For example, a calendar client may need to invoke an action in order to download content in a format it can understand, and it may appear as the <code>curl</code> program when accessing a site; or it may be convenient for some users to use tools like <code>wget</code> to access files stored as attachments. </para><para>To allow certain clients to perform actions, change the <code>ua_spiders</code> setting in your <ulink url="http://ei-www.hyogo-dai.ac.jp/%7Eetsuo/moin.cgi/HelpOnRobots/~etsuo/moin.cgi/HelpOnConfiguration#">configuration</ulink>. One way of doing this is to redefine the setting as follows: </para><programlisting format="linespecific" language="python" linenumbering="numbered" startinglinenumber="1"><methodname><![CDATA[ua_spiders]]></methodname><![CDATA[ = ]]><methodname><![CDATA[multiconfig]]></methodname><![CDATA[.]]><methodname><![CDATA[DefaultConfig]]></methodname><![CDATA[.]]><methodname><![CDATA[ua_spiders]]></methodname><![CDATA[.]]><methodname><![CDATA[replace]]></methodname><![CDATA[(]]><phrase><![CDATA["]]></phrase><phrase><![CDATA[|wget]]></phrase><phrase><![CDATA["]]></phrase><![CDATA[, ]]><phrase><![CDATA["]]></phrase><phrase><![CDATA["]]></phrase><![CDATA[).]]><methodname><![CDATA[replace]]></methodname><![CDATA[(]]><phrase><![CDATA["]]></phrase><phrase><![CDATA[|curl]]></phrase><phrase><![CDATA["]]></phrase><![CDATA[, ]]><phrase><![CDATA["]]></phrase><phrase><![CDATA["]]></phrase><![CDATA[)]]>
</programlisting><para>The above assumes that the configuration class is <code>multiconfig.DefaultConfig</code> - you will need to consult which class is used in your own configuration - and proceeds to edit the setting to remove <code>wget</code> and <code>curl</code> from a regular expression describing robots or clients that are blocked from invoking actions. </para></section></section></article>