Page MenuHomePhabricator

Global robots.txt contains invalid empty line
Closed, ResolvedPublic

Description

The global robots.txt loaded from '/srv/mediawiki/robots.txt' by robots.php seems to contain an empty line on row 423:

420 Disallow: /wiki/Wikiquote%3AVotes_for_deletion_archive/                                                                                                                                                                                   
421 Disallow: /wiki/Wikiquote_talk:Votes_for_deletion_archive/                                                                                                                                                                                
422 Disallow: /wiki/Wikiquote_talk%3AVotes_for_deletion_archive/                                                                                                                                                                              
423                                                                                                                                                                                                                                           
424 # enwikibooks                                                                                                                                                                                                                             
425 Disallow: /wiki/Wikibooks:Votes_for_deletion                                                                                                                                                                                              
426 #

This is in violation of the Robots Exclusion Standard (http://www.robotstxt.org/orig.html), which defines the empty line as a separator between the records. A bot that adheres strictly to the specification may ignore all directives below that line, since they are missing the corresponding User-agent line, with which each record must start. Effectively, this could disable all projects' custom MediaWiki:Robots.txt, as their contents are appended at the end of the global robots.txt file and are supposed to be in the single large record for 'User-agent: *' that starts on line 147 in the global robots.txt.

For examples, see the robots.txt files of enwiki, bgwiki and meta (any other project is almost certainly the same):

https://en.wikipedia.org/robots.txt
https://bg.wikipedia.org/robots.txt
https://meta.wikimedia.org/robots.txt

While many bots may in fact ignore this invalid empty line, it's still best to adhere to the specifications as much as possible.

Event Timeline

I see numerous blank lines before line 423 in https://en.wikipedia.org/robots.txt

It's from https://github.com/wikimedia/operations-mediawiki-config/blob/master/robots.txt (well, the gerrit canonical source, but github is easier to link)

I see numerous blank lines before line 423 in https://en.wikipedia.org/robots.txt

Yes, but they are before line 147 and are valid (and necessary) record separators. Note that after each such blank line there is a 'User-agent' one as well, with which a new record starts.

Everything after line 147 is supposed to be one single record that pertains to 'User-agent: *' (that is, to all bots), and therefore must not contain any blank lines.

It's from https://github.com/wikimedia/operations-mediawiki-config/blob/master/robots.txt (well, the gerrit canonical source, but github is easier to link)

Ah, good to know, thanks. Not sure why I wasn't able to find it online.

Please feel free to submit a change request on gerrit to the operations/mediawiki-config repository :)

Please feel free to submit a change request on gerrit to the operations/mediawiki-config repository :)

Thanks, that'll be indeed a good opportunity to get acquainted with it. :)

Whom should I add as reviewers?

kerberizer triaged this task as Medium priority.

@kerberizer You only must schedule it at https://wikitech.wikimedia.org/wiki/Deployments and then be available in #wikimedia-operations during the deploy window. Somebody will auto-add himself/herself as reviewer. You can add nobody as reviewer.

Feel free to add me as a reviewer to robots.txt patch , add it to https://wikitech.wikimedia.org/wiki/Deployments and I will deploy on the cluster on the next window I am handling. For robots.txt and just dropping an empty line, your presence is not required -but still welcome-.

Bonus point if one write a PHPUnit test that validates the robots.txt against the spec. That will prevent newlines (or some oddity) from being introduced again :}

@hashar, much appreciated indeed! I'll see to push it tomorrow.

As for the unit test, that's indeed a good idea. I'm not much of a programmer, but may eventually also have a look at it if only out of curiosity, and if no one else makes it there first, of course.

Change 313763 had a related patch set uploaded (by Kerberizer):
Fix an invalid empty line in the global robots.txt

https://gerrit.wikimedia.org/r/313763

Added to today European SWAT window

@hashar, thanks. I had already added it for the Morning SWAT, but if the EU mid-day SWAT is more convenient for you, it's totally fine by me (hadn't added it there, as all the 8 slots seemed taken). I think I'll also be available in IRC if the need arises.

Change 313763 merged by jenkins-bot:
Fix an invalid empty line in the global robots.txt

https://gerrit.wikimedia.org/r/313763

Mentioned in SAL (#wikimedia-operations) [2016-10-03T14:28:26Z] <zfilipin@tin> Synchronized robots.txt: SWAT: [[gerrit:313763|Fix an invalid empty line in the global robots.txt (T146908)]] (duration: 00m 47s)