Page MenuHomePhabricator

Need robots.txt policy for XHGui
Closed, ResolvedPublic

Description

Files under https://performance.wikimedia.beta.wmflabs.org/xhgui should definitely be disallowed. When I was testing the new XHGui today I noticed a lot of crawler traffic hitting old profile IDs.

Files under https://performance.wikimedia.org/xhgui should probably be disallowed, too, unless there's something I'm missing.

Arguably, we should disallow everything on the beta site, as there's no reason for crawlers to be hitting it instead of the main site. To do so, we'd need a mechanism for serving a different robots.txt based on realm, e.g. an alias in the httpd configuration.

Event Timeline

Perhaps we can upstream this? I think they'd be open to simply setting robots noindex,nofollow on everything maybe the landing page that one could be nofollow,index or some such. That'd be fine for Beta as well, I think?

As a short-term measure we could have Puppet provision a robots.txt file on top perhaps, but if we let XHGui handle it, we might not need to.

Also, since this is currently mounted from a subdirectory, we can't actually have XHGui's Puppet provisoning take care of this in an easy way. The robots.txt file is, in our case, owned by the static perf site.

noindex,nofollow is a good idea. Is there any reason why we would want anything under /xhgui crawled?

I was planning to add something like this to performance-website.erb in Puppet:

<% if @realm == "labs" %>
Alias /robots.txt /srv/org/wikimedia/performance/public_html/robots-beta.txt
<% end -%>

...plus the relevant file in the static site repo. Plus probably a <link rel="canonical"> tag in index.html.

noindex,nofollow is a good idea. Is there any reason why we would want anything under /xhgui crawled?

Nope, not that I can think of. Neither prod nor beta.

I was planning to add something like this to performance-website.erb in Puppet:

<% if @realm == "labs" %>
Alias /robots.txt /srv/org/wikimedia/performance/public_html/robots-beta.txt
<% end -%>

...plus the relevant file in the static site repo. Plus probably a <link rel="canonical"> tag in index.html.

That seems fine yeah, although we could also put it in the static repo as-is in general more widely. I don't think it needs to be Beta-specific.

Regarding canonical, I think it'd be better to treat these as separate installs to the public and not indicate one as being canonical for the other.

Change 608961 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/docroot@master] Add Disallow: /xhgui to robots.txt

https://gerrit.wikimedia.org/r/c/performance/docroot/ /608961

Change 608962 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] webperf: Serve different robots.txt on beta site

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608962

Change 608961 merged by jenkins-bot:
[performance/docroot@master] Add Disallow: /xhgui to robots.txt

https://gerrit.wikimedia.org/r/c/performance/docroot/ /608961

Change 608962 merged by Dzahn:
[operations/puppet@production] webperf: Serve different robots.txt on beta site

https://gerrit.wikimedia.org/r/608962

Krinkle triaged this task as Medium priority.
Krinkle updated the task description. (Show Details)