Maniphest T204369

Investigate NOINDEXing CopyPatrol
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Niharika
	Sep 14 2018, 5:17 PM

Description

Requested here.

I was surprised to find the CopyPatrol pages are not NOINDEXed. Google searching usernames such as "Natstaropoli" and "RMGroup17" revealed this fact. I would appreciate it if this page did not appear in Google search results, as it could become a target for vandalism.

Event Timeline

Niharika triaged this task as Low priority.Sep 14 2018, 5:17 PM

Niharika created this task.

Restricted Application added a project: Community-Tech. · View Herald TranscriptSep 14 2018, 5:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I blocked Googlebot and the usual other web crawlers in the lighttpd configuration:

$HTTP["useragent"] =~ "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot)" {
  url.access-deny = ( "" )
}

That should work. In the past I haven't had much luck with crawlers honouring robots.txt on Toolforge, even Googlebot.

In T204369#4584643, @MusikAnimal wrote:

I blocked Googlebot and the usual other web crawlers in the lighttpd configuration:

$HTTP["useragent"] =~ "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot)" {
  url.access-deny = ( "" )
}

Awesome! Thanks!

That should work. In the past I haven't had much luck with crawlers honouring robots.txt on Toolforge, even Googlebot.

That sucks. :/ Let's see how this goes.

Is there an API for this tool or a CLI client? If so, would blocking Python-urllib cause a problem for them?

No API or CLI client. Python-urllib should normally be OK but with XTools we saw abuse. I just copied the things we were blocking there.

Oh, cool that we already had a set of bad actors. Thanks!

Niharika closed this task as Resolved.Sep 18 2018, 5:26 PM

Niharika claimed this task.

Niharika moved this task from Backlog to Done on the CopyPatrol board.

Can't find any bot directive in the header of the page https://tools.wmflabs.org/copypatrol/fr. Perhaps adding a noindex tag can be useful ?

Created https://github.com/wikimedia/CopyPatrol/pull/54

Merged! Blocking user agents seems to be effective but indeed using a meta tag and/or robots.txt is the more proper way. Thanks

Investigate NOINDEXing CopyPatrolClosed, ResolvedPublicActions

Description

Event Timeline

Investigate NOINDEXing CopyPatrol
Closed, ResolvedPublic
Actions