Page MenuHomePhabricator

craft robots.txt to intelligently exclude dynamic pages with cleaner URLs
Closed, DeclinedPublic

Description

The approved RFC for cleaner URLs of the form /wiki/PageName?action=history proposed a simple

Disallow: /wiki/*?

robots.txt rule. Nemo bis and Dantman point out inadequacies in this (see T14619#1192903, the RFC Talk page, and 2013 mailing list discussion), e.g.

Excluding default "dynamic" pages like the history from crawling makes sense, but reducing the availability of content more than what we do now is unwarranted and a huge cost ...

It would be helpful to identify here all the classes of URLs we do and do not want crawled.

Event Timeline

Spage raised the priority of this task from to Needs Triage.
Spage updated the task description. (Show Details)
Spage added subscribers: Spage, matmarex, He7d3r and 6 others.

I also brought up currently existing bugs and flaws:
The possibility we might actually want Allpages to be reasonably indexable (potentially helpful to smaller wiki).

And more importantly, redlinked categories.
Non existent categories still have a full category page listing, they do exist.
From articles, we link to them using action=edit&redlink=1 which redirects to the /wiki/Category: page.
But the currently recommended robots.txt blacklists these meaning search engines do not follow red category links to their category pages.
These category pages do not get link juice from all the pages that link to them like normal category pages do.
And on small installations without sitemaps, search engines may never discover these pages at all.
Which on small installations, in combination with the Allpages issue, means other pages in the category that don't receive explicit links from other pages also may never be discovered by search engines.

Under the current url structure, this could be fixed by using /wiki/ urls for redlink=1 edit links. But if /wiki/*? is blacklisted, this bug will become impossible to fix.

And of course this /wiki/*? blacklist basically guarantees we will never be able to index paginated or non-basic content even when we want to.

For example, non-initial pages of categories say they are indexable. However because they use /index.php currently, so they are blacklisted completely from search engines. But could be fixed by using /wiki/.

I don't think this is possible.

jeremyb-phone edited subscribers, added: jeremyb; removed: Aklapper.
jeremyb-phone added a subscriber: Aklapper.