Page MenuHomePhabricator

Scripts loaded via importScript() are blocked from Googlebot
Closed, InvalidPublic

Description

(original report at http://comments.gmane.org/gmane.science.linguistics.wikipedia.technical/83171 )

enwiki splits off the file-related parts of Common.js to a subpage and imports it for every page view in the File: namespace via importScript. (There are other examples of high-impact importScript usage; this seemed to be the most prevalent.)

importScript loads pages by appending a script with src <wiki domain>/w/index.php?title=<page>&action=raw&ctype=text/javascript to the head. Besides not being great performance-wise, this leads to the Google Webmaster Tools being spammed by reports of Googlebot being blocked, as these URLs match the robots.txt ban for /w/.

There are two ways of addressing this:

  • amend robots.txt to allow index.php with action=raw (seems kind of painful, regular expressions are not great for parsing URL query strings)
  • turn some of the code in Common.js into a gadget so it can take advantage of the ResourceLoader infrastructure.

Event Timeline

Tgr raised the priority of this task from to Needs Triage.
Tgr updated the task description. (Show Details)
Tgr added projects: MediaWiki-General, JavaScript.
Tgr subscribed.

@Anomie pointed out in an email thread that the ban on /w/ in general is to prevent indexing of action=history among other things. Tangentially that's something we will need to address in another way if T14619: Use article path URLs for editing, previewing skins, etc. is ever implemented.

Right now https://en.wikipedia.org/wiki/MediaWiki:Common.js/file.js?action=raw&ctype=text/javascript returns a 403 so a caveman fix of replacing /w/index.php?action= with /wiki/ directly apparently won't work.

Right now https://en.wikipedia.org/wiki/MediaWiki:Common.js/file.js?action=raw&ctype=text/javascript returns a 403 so a caveman fix of replacing /w/index.php?action= with /wiki/ directly apparently won't work.

It would probably mess up pageview analytics as well.

We could use something like /w/index.php?s&action= and then whitelist on that. It's a nasty hack and doesn't address the performance aspects of using importScript though.

Adam/Jon/Wes -- I feel like this lines up with the SEO work that we've been doing. Should we try to connect these and make a larger project out them?

-Toby

The exact url formation for import scripts is not easily changed. Firstly, it is hardcoded in many places. Changing this would be non-trivial. Secondly, MediaWiki depends on this exact format to ensure squid (varnish) purges are sent when the underlying wikipage is modified.

I'd like to return to the original issue and challenge it. What is the actual impact of this warning? Don't let robot tools dictate our terms. They're just tools to help discover potential issues. They are not themselves proof of any problem. Much like how messages from the W3C Validator carry very little value in practice since we develop for real-world web browsers, not validators. (And the reality is that browsers can themselves be "invalid".)

Assuming that the only impact is that Google won't fetch these scripts, I'd say that's absolutely fine. Perhaps even preferable. It just means that when Google simulates a page rendering for its search index, it will not consider any modifications these scripts may make to the page.

That's fine since these scripts provide user interaction, not content. Some websites nowadays render their pages client-side, but MediaWiki does not. We render our pages server-side. This is the industry best practice. Even for client-side apps it is recommended to pre-render the first page view server-side for improved performance.

Note that Google didn't use to even try to fetch scripts until a few years ago. It now starts to try do a minimal simulated rendering to accommodate those pure-JavaScript websites. It shouldn't affect us.

+1 on Krinkle. This report lacks a problem statement.

w:en:MediaWiki:Common.js/file.js has been removed by @Edokter .

Are there other cases worth investigating?

Common.js still loads edit.js and watchlist.js through importScript; edit.js because of a seemingly longstanding bug T10912, and extra buttons for the old edit toolbar; and watchlist.js to add butons to dismiss watchlist messages.

Is the old toolbar even still worth keeping? And do we have some replacement for the dismiss button by any chace? If not, I'll investigate moving these separate scripts to gadgets.

Common.js still loads edit.js and watchlist.js through importScript; edit.js because of a seemingly longstanding bug T10912, and extra buttons for the old edit toolbar; and watchlist.js to add butons to dismiss watchlist messages.

Is the old toolbar even still worth keeping? And do we have some replacement for the dismiss button by any chace? If not, I'll investigate moving these separate scripts to gadgets.

I am guessing those two are skipped by Googlebot, as it doesnt load URLs where they would be activated.

Are there other cases worth investigating?

Commons, for example, uses importScript extensively. I'm sure there are many other wikis that do.

  • The performance concerns mentioned regarding loading of action=raw are mostly invalid nowadays (aside from lacking minification). In all other layers (MediaWiki, Varnish, purging etc.) they perform as well or better than page views.
  • Warnings in GWT don't affect anything. Similar to W3C validation errors. They're good suggestions, but shouldn't be pursued as a goal of their own or used as a measure.
  • The scripts in question don't seem to significant w.r.t. page content or Google indexing thereof.
  • The scripts in question have mostly been replaced with in-place modifications of Common.js (for improved performance) or replaced or removed in favour of CSS or gadgets.