Page MenuHomePhabricator

Implement a PHPUnit test to validate robots.txt against the specs
Open, LowestPublic

Description

As discussed in T146908, it would be helpful to have a PHPUnit test to validate the global robots.txt file against the existing specifications. All projects rely on this global robots.txt to be correct, so their own MediaWiki:Robots.txt (which are appended per project to the global one) are also taken into account by the web crawlers and perform as expected.

There is a robots.txt PHP parser class on GitHub, which may be helpful in implementing the unit test, at least as a copy-paste inspiration source on how a robots file can be parsed in PHP. Of particular interest may also be the project's wiki, which has a nice summary of the relevant specifications.

Unlike this parser library, the unit test most likely doesn't need to understand and implement the more intricate details of the specifications and could focus instead only on what is already used in the existing robots.txt (e.g. fail if empty lines exist within a record, which was the particular problem with T146908). In other words, false positives caused by some obscure if valid syntax are probably fine, as long as there are no false negatives. This may make the task substantially easier.

It may be nice to also cover cases like this old bug report, where the problem had not been with the source robots.txt per se, but rather with how robots.php was handling it.

Event Timeline

kerberizer triaged this task as Lowest priority.Oct 3 2016, 3:07 AM

@hashar, could you please add anyone who might be interested in keeping track of this task or, better yet, who might want to claim it, as I don't know the devs unfortunately. For myself, while I'd very much like to help, I'm afraid it isn't likely to happen very soon, considering that I do very little programming overall, and have almost zero experience in PHP. Still, I'll keep an eye on it. Thanks! :)

It might be good to have some validation of the Mediawiki:Robots.txt pages as well.