Page MenuHomePhabricator

www.wikipedia.org/robots.txt should not be a redirect
Open, MediumPublic

Description

From Google PageSpeed Insights (as used by Google Search console)

cap.png (844×2 px, 77 KB)

Lighthouse "SEO" audits (90/100):

  • robots.txt is not valid. Lighthouse was unable to download a robots.txt file

I can reproduce this locally in the Lighthouse audit via Chrome DevTools. This is probably because the URL is effectively a redirect to en.wikipedia.org/robots.txt. I say effectively, because it is trapped by a catch-all rule in Apache where all unknown and would-be 404s are sent to en.wikipedia.org first.

According to Internet Archive this regression happened between 6 and 11 June 2008:

cap.png (1×2 px, 267 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Hi @Krinkle . I will like to work on this. Can you provide me a bit of insight as to what should I change? As far as I can see the URL is opening robots.txt. Also I'll appreciate it if you can send the repo link as well. Thanks!

Hi @Krinkle . I will like to work on this. Can you provide me a bit of insight as to what should I change? As far as I can see the URL is opening robots.txt. Also I'll appreciate it if you can send the repo link as well. Thanks!

Hey! You can find the link to the repository in the project description - Wikimedia-Portals :)

Hi, could you please elaborate what is specifically unclear with the task description?
The repo itself does not have a robots.txt file, and any non-existing URL gets redirected - try e.g. with https://www.wikipedia.org/whatever .

@Aklapper That was exactly my doubt. That is something needed to be added to the file or a new file is to be created. Thanks for the clarification!

I am guessing Wikimedia wants the site to be accessible from all of the common user-agents. Any specific user-agents that you want to block @Iniquity ?
CC :- @Aklapper

This task is not asking for new robots rules to be created or modified. The existing content should stay exactly as-is. No changes or additions are needed in the portal Git repository.

This task is about the Apache configuration for www.wikipedia.org, which is in the Puppet Git repository.

Specifically, to ensure to serves robots.txt (as it does today already) without a redirect. If that is not possible with rewrites or bypasses then we should ask Reading-Web which subset of rules we want to hardcode and duplicate for the portals. I imagine only a very narrow subset of what we have today would be relevant indeed as most of enwiki/robots.txt rules today wouldn't do much.

Change 662761 had a related patch set uploaded (by SarthakKundra; owner: SarthakKundra):
[wikimedia/portals@master] Created robot.txt

https://gerrit.wikimedia.org/r/662761

Here's a Lighthouse CI score run on dev-server. Let me know if any changes are required :)

Screenshot 2021-02-09 at 12.36.14 AM.png (1×2 px, 362 KB)

Can you tell me where I can find the puppet Git repository? Is it the one on Github?

Change 662761 abandoned by Aklapper:
[wikimedia/portals@master] Created robot.txt

Reason:
We should not add a robot.txt file; instead the Apache config in the operations/puppet repository needs changes

https://gerrit.wikimedia.org/r/662761

Can you tell me where I can find the puppet Git repository? Is it the one on Github?

In T242500#6812314, Krinkle wrote:

This task is about the Apache configuration for www.wikipedia.org, which is in the Puppet Git repository.

operations/puppet in Gerrit; see https://www.mediawiki.org/wiki/New_Developers . (Also note that this task isn't marked as a good first task :)

@Jdrewniak Is this task somehow related to the past patch with title "Serve a default /robots.txt on a 404 from the backend" ?

@Techwizzie I've removed this task from the GSoC portals project (I probably shouldn't have added it there in the first place). The apache configuration is something that affects all of our production sites and should be handled by someone with intimate knowledge of how it works, therefore, I don't recommend changing it.