craft robots.txt to intelligently exclude dynamic pages with cleaner URLs
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• Spage
	Apr 9 2015, 9:58 PM

Description

The approved RFC for cleaner URLs of the form /wiki/PageName?action=history proposed a simple

Disallow: /wiki/*?

robots.txt rule. Nemo bis and Dantman point out inadequacies in this (see T14619#1192903, the RFC Talk page, and 2013 mailing list discussion), e.g.

Excluding default "dynamic" pages like the history from crawling makes sense, but reducing the availability of content more than what we do now is unwarranted and a huge cost ...

It would be helpful to identify here all the classes of URLs we do and do not want crawled.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Feature	None	T14619 Use article path URLs for editing, previewing skins, etc.
Open	Feature	None	T16720 robots.txt (tracking)
Declined		None	T95625 craft robots.txt to intelligently exclude dynamic pages with cleaner URLs

Event Timeline

• Spage created this task.Apr 9 2015, 9:58 PM

• Spage raised the priority of this task from to Needs Triage.

• Spage updated the task description. (Show Details)

• Spage added projects: WMF-General-or-Unknown, Wikimedia-Apache-configuration.

• Spage added subscribers: • Spage, • matmarex, He7d3r and 6 others.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 9 2015, 9:58 PM

• Spage mentioned this in T14619: Use article path URLs for editing, previewing skins, etc..Apr 9 2015, 10:09 PM

I also brought up currently existing bugs and flaws:
The possibility we might actually want Allpages to be reasonably indexable (potentially helpful to smaller wiki).

And more importantly, redlinked categories.
Non existent categories still have a full category page listing, they do exist.
From articles, we link to them using action=edit&redlink=1 which redirects to the /wiki/Category: page.
But the currently recommended robots.txt blacklists these meaning search engines do not follow red category links to their category pages.
These category pages do not get link juice from all the pages that link to them like normal category pages do.
And on small installations without sitemaps, search engines may never discover these pages at all.
Which on small installations, in combination with the Allpages issue, means other pages in the category that don't receive explicit links from other pages also may never be discovered by search engines.

Under the current url structure, this could be fixed by using /wiki/ urls for redlink=1 edit links. But if /wiki/*? is blacklisted, this bug will become impossible to fix.

And of course this /wiki/*? blacklist basically guarantees we will never be able to index paginated or non-basic content even when we want to.

For example, non-initial pages of categories say they are indexable. However because they use /index.php currently, so they are blacklisted completely from search engines. But could be fixed by using /wiki/.

I don't think this is possible.

Nemo_bis unsubscribed.Apr 10 2015, 7:46 AM

Aklapper triaged this task as Low priority.Apr 10 2015, 11:25 AM

jeremyb-phone set Security to None.Apr 10 2015, 12:53 PM

jeremyb-phone edited subscribers, added: jeremyb; removed: Aklapper.

jeremyb-phone added a subscriber: Aklapper.

Glaisher subscribed.Jun 7 2015, 4:12 PM

Danny_B added a parent task: T16720: robots.txt (tracking).Feb 15 2016, 9:59 AM

Krinkle closed this task as Declined.Apr 29 2017, 2:37 AM