Investigate NOINDEXing CopyPatrol
Requested here.

I was surprised to find the CopyPatrol pages are not NOINDEXed. Google searching usernames such as "Natstaropoli" and "RMGroup17" revealed this fact. I would appreciate it if this page did not appear in Google search results, as it could become a target for vandalism.

I blocked Googlebot and the usual other web crawlers in the lighttpd configuration:

$HTTP["useragent"] =~ "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot)" {
  url.access-deny = ( "" )

That should work. In the past I haven't had much luck with crawlers honouring robots.txt on Toolforge, even Googlebot.

Awesome! Thanks!

That sucks. :/ Let's see how this goes.

Is there an API for this tool or a CLI client? If so, would blocking Python-urllib cause a problem for them?

No API or CLI client. Python-urllib should normally be OK but with XTools we saw abuse. I just copied the things we were blocking there.

Oh, cool that we already had a set of bad actors. Thanks!

Can't find any bot directive in the header of the page Perhaps adding a noindex tag can be useful ?

Merged! Blocking user agents seems to be effective but indeed using a meta tag and/or robots.txt is the more proper way. Thanks