Page MenuHomePhabricator

Investigate NOINDEXing CopyPatrol
Closed, ResolvedPublic

Description

Requested here.

I was surprised to find the CopyPatrol pages are not NOINDEXed. Google searching usernames such as "Natstaropoli" and "RMGroup17" revealed this fact. I would appreciate it if this page did not appear in Google search results, as it could become a target for vandalism.

Event Timeline

Niharika triaged this task as Low priority.Sep 14 2018, 5:17 PM
Niharika created this task.
Restricted Application added a project: Community-Tech. · View Herald TranscriptSep 14 2018, 5:17 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
MusikAnimal added a subscriber: MusikAnimal.EditedSep 14 2018, 5:32 PM

I blocked Googlebot and the usual other web crawlers in the lighttpd configuration:

$HTTP["useragent"] =~ "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot)" {
  url.access-deny = ( "" )
}

That should work. In the past I haven't had much luck with crawlers honouring robots.txt on Toolforge, even Googlebot.

I blocked Googlebot and the usual other web crawlers in the lighttpd configuration:

$HTTP["useragent"] =~ "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot)" {
  url.access-deny = ( "" )
}

Awesome! Thanks!

That should work. In the past I haven't had much luck with crawlers honouring robots.txt on Toolforge, even Googlebot.

That sucks. :/ Let's see how this goes.

aezell added a subscriber: aezell.Sep 14 2018, 5:38 PM

Is there an API for this tool or a CLI client? If so, would blocking Python-urllib cause a problem for them?

MusikAnimal added a comment.EditedSep 14 2018, 5:44 PM

No API or CLI client. Python-urllib should normally be OK but with XTools we saw abuse. I just copied the things we were blocking there.

Oh, cool that we already had a set of bad actors. Thanks!

Niharika closed this task as Resolved.Sep 18 2018, 5:26 PM
Niharika claimed this task.
Niharika moved this task from Backlog to Done on the CopyPatrol board.
Framawiki added a subscriber: Framawiki.EditedSep 30 2018, 9:27 AM

Can't find any bot directive in the header of the page https://tools.wmflabs.org/copypatrol/fr. Perhaps adding a noindex tag can be useful ?

MusikAnimal closed this task as Resolved.Oct 1 2018, 2:11 AM

Merged! Blocking user agents seems to be effective but indeed using a meta tag and/or robots.txt is the more proper way. Thanks