Page MenuHomePhabricator

Investigate NOINDEXing CopyPatrol
Closed, ResolvedPublic

Description

Requested here.

I was surprised to find the CopyPatrol pages are not NOINDEXed. Google searching usernames such as "Natstaropoli" and "RMGroup17" revealed this fact. I would appreciate it if this page did not appear in Google search results, as it could become a target for vandalism.

Event Timeline

Niharika created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I blocked Googlebot and the usual other web crawlers in the lighttpd configuration:

$HTTP["useragent"] =~ "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot)" {
  url.access-deny = ( "" )
}

That should work. In the past I haven't had much luck with crawlers honouring robots.txt on Toolforge, even Googlebot.

I blocked Googlebot and the usual other web crawlers in the lighttpd configuration:

$HTTP["useragent"] =~ "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot)" {
  url.access-deny = ( "" )
}

Awesome! Thanks!

That should work. In the past I haven't had much luck with crawlers honouring robots.txt on Toolforge, even Googlebot.

That sucks. :/ Let's see how this goes.

Is there an API for this tool or a CLI client? If so, would blocking Python-urllib cause a problem for them?

No API or CLI client. Python-urllib should normally be OK but with XTools we saw abuse. I just copied the things we were blocking there.

Oh, cool that we already had a set of bad actors. Thanks!

Niharika claimed this task.
Niharika moved this task from Backlog to Done on the CopyPatrol board.

Can't find any bot directive in the header of the page https://tools.wmflabs.org/copypatrol/fr. Perhaps adding a noindex tag can be useful ?

Merged! Blocking user agents seems to be effective but indeed using a meta tag and/or robots.txt is the more proper way. Thanks