https://www.wikipedia.org/ portal doesn't have any text
Closed, ResolvedPublic

Description

https://www.wikipedia.org/ portal doesn't have any text right now. The search form works, though.

The portal page is trying to load JavaScript from https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-d1cc91a7f4.js, which is redirecting to a 404 error page right now.

There are a very large number of changes, so older changes are hidden. Show Older Changes
Paladox triaged this task as High priority.Feb 22 2017, 5:17 PM
Paladox raised the priority of this task from High to Unbreak Now!.
Paladox added a subscriber: Paladox.
Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptFeb 22 2017, 5:17 PM

I get this error in the developer tools

[Error] ReferenceError: Can't find variable: doWhenReady
(anonymous function) (gt-ie9-c84bf66d33.js:1)
Global Code (gt-ie9-c84bf66d33.js:1:634)

ema added a subscriber: ema.Feb 22 2017, 5:29 PM

Note that the JS file mentioned above loads fine adding a query argument: https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-d1cc91a7f4.js?x=1

ema added a comment.Feb 22 2017, 5:37 PM

And the redirect comes from mediawiki:

< HTTP/1.1 301 Moved Permanently
< Date: Wed, 22 Feb 2017 17:33:13 GMT
< Content-Type: text/html; charset=iso-8859-1
< Content-Length: 242
< Connection: keep-alive
< Server: mw1249.eqiad.wmnet
< Location: https://en.wikipedia.org/w/404.php
< Vary: X-Forwarded-Proto, Accept-Encoding
< X-Varnish: 389668638, 253815670 944004746, 20563401 21955905
< Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
< Age: 80826
< X-Cache: cp1054 miss, cp3030 hit/7, cp3031 hit/637
< X-Cache-Status: hit

Is this a mediawiki core change causing this problem? Should MediaWiki-General-or-Unknown be added?

@Paladox Seems more likely it's related to the last deploy on T128546, but that's possible too.

There is a .jsl10n CSS rule with visibility: hidden. I guess JS code should show it, but it's not the case.

Oh, ok, i will add the project just in case. We can always remove it later if it turns on that it was something different.

This comment was removed by Dereckson.

[ Removed tag as not served by MediaWiki software, but by custom code in the portals repository ]

mxn added a subscriber: mxn.Feb 22 2017, 5:55 PM

Change 339227 had a related patch set uploaded (by Dereckson):
Rollback www.wikipedia.org portal code

https://gerrit.wikimedia.org/r/339227

Change 339227 merged by jenkins-bot:
Rollback www.wikipedia.org portal code

https://gerrit.wikimedia.org/r/339227

Mentioned in SAL (#wikimedia-operations) [2017-02-22T18:17:50Z] <Dereckson> Last two deployment entries were to rollback portals/ to last known state (T158782)

Dereckson added a comment.EditedFeb 22 2017, 6:21 PM

The rollback worked on mwdebug1002 (it works on mwdebug1001 too), but doesn't work in prod.

Update: It works, but JS 404 is cached on local browser, use hard local cache refresh (ctrl + shift + r on Chrome for example) to force reload it.

Dereckson lowered the priority of this task from Unbreak Now! to Normal.Feb 22 2017, 6:26 PM
Dereckson raised the priority of this task from Normal to High.
debt added a subscriber: debt.Feb 22 2017, 6:29 PM

Thanks for your quick help, @Dereckson !

greg added a comment.Feb 22 2017, 6:42 PM

(And thanks, @Dereckson :) )

For reference, this was originally reported here, before I made this Phab task: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#www.wikipedia.org_page_looks_weird (15:49, 22 February 2017 (UTC)).

demon added a subscriber: demon.Feb 23 2017, 1:26 AM

This can be resolved now, right? Issue fixed, report filed. Have tasks been filed for the actionables?

I'm not sure why T158810 or T158808 would be needed to have prevented this incident.

  • How was it possible for the future url to be used (and thus cached) before it existed? Presumably because the files were synced in the wrong order?
  • How come the 404 error was cached? Our Varnish configuration very explicitly does not cache 4xx and 5xx errors for longer than 5 minutes. Worst case scenario, this deployment should've fixed itself automatically within 5 minutes. I suspect the reason is because it is a redirect, and that is very bad indeed.

The portal page is trying to load JavaScript from https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-d1cc91a7f4.js, which is redirecting to a 404 error page right now.

Why would it redirect? This is a problem since HTTP 3xx codes are not errors.

< HTTP/1.1 301 Moved Permanently
< Server: mw1249.eqiad.wmnet
< Location: https://en.wikipedia.org/w/404.php
< Age: 80826
< X-Cache-Status: hit

And the redirect comes from mediawiki

Not from mediawiki, really. Just from a text apache, which is where the portal is hosted as well. The redirect looks like the result of some bad Apache configuration that should be doing a rewrite instead a redirect. And in any event, definitely not a permanent 301 redirect.

That it the real culprit.

demon added a comment.Feb 23 2017, 6:06 PM

I'm not sure why T158810 or T158808 would be needed to have prevented this incident.

  • How was it possible for the future url to be used (and thus cached) before it existed? Presumably because the files were synced in the wrong order?

This, as I said repeatedly on IRC yesterday, is the biggest problem here. The fact that we've got code that relies on correct sync order is suspect and needs to be thought about--yes, I'm looking at that bastard sync-portals script.

You're right that T158808 is probably a bad idea and wouldn't have helped at all. T158810 may not have prevented it, but it certainly wouldn't have hurt and I would sleep better knowing that we've removed some of the crazy here.

Not from mediawiki, really. Just from a text apache, which is where the portal is hosted as well. The redirect looks like the result of some bad Apache configuration that should be doing a rewrite instead a redirect. And in any event, definitely not a permanent 301 redirect.

That it the real culprit.

Ahhhhh, that makes way more sense.

This problem has happened again. I see no text. I did a force refresh in chrome and still nothing.

Paladox raised the priority of this task from High to Unbreak Now!.Feb 23 2017, 6:43 PM
greg added a comment.Feb 23 2017, 6:47 PM

@Jdrewniak was there another deploy? When did that happen? I don't see anything in the SAL

Dereckson lowered the priority of this task from Unbreak Now! to High.EditedFeb 23 2017, 6:47 PM

Could be a cache issue: I've sent a purge request to the www.wikipedia.org URL, and it works again.

Terbium
$ echo https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-4398b00936.js | mwscript purgeList.php
Purging 1 urls
Done!
Gehel added a comment.Feb 23 2017, 6:51 PM

For reference, the Apache configuration backing the portals seems to be https://github.com/wikimedia/puppet/blob/production/modules/mediawiki/templates/apache/sites/wwwportals.conf.erb . I'll dig into it...

Works now after refresh.

For reference, the Apache configuration backing the portals seems to be https://github.com/wikimedia/puppet/blob/production/modules/mediawiki/templates/apache/sites/wwwportals.conf.erb . I'll dig into it...

Relevant part seems:

RewriteRule ^/portal/(.*)$ /<%= @portal_dir %>/$1 [L]
Header set Cache-Control "s-maxage=86400, max-age=86400"
greg added a comment.Feb 23 2017, 7:16 PM

how often is this portal being updated and what is the process? I can't see it in the SAL and I would like to see it there.

@greg there was no updates to the portal today. This brief error today must have happened because the portal was rolled back yesterday, but the HTML page remained cached on our end for 24h (we didn't purge that url). In general the updates are done during SWAT, and should be visible in SAL (not sure why it says the last deploy was last week though).

MaxSem added a subscriber: MaxSem.Feb 23 2017, 7:42 PM

how often is this portal being updated and what is the process? I can't see it in the SAL and I would like to see it there.

It's there, actually:

14:13 hashar@tin: Synchronized portals: (no justification provided) (duration: 00m 41s)
14:12 hashar@tin: Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 40s)

The process is to update submodule and then use sync-portals to deploy it in two steps and then purge / to avoid problems like this bug. Which makes me wonder what the hell happened in this case as SAL indicates clearly that the correct process was used.

The issue seems to be with:

RewriteRule ^/(upload|wiki|stats|w)/(.*)$ %{ENV:RW_PROTO}://en.wikipedia.<%= @domain_suffix %>/$1/$2 [R=301,L]

which rewrites the ErrorDocument (/w/404.php) into a cross host 301 redirect. Adding a RewriteCond to exclude /w/404.php might work, but feels ugly. I need a bit more thinking...

Change 339657 had a related patch set uploaded (by Gehel):
portals: do not rewrite 404 errors

https://gerrit.wikimedia.org/r/339657

For reference, the Apache configuration backing the portals seems to be https://github.com/wikimedia/puppet/blob/production/modules/mediawiki/templates/apache/sites/wwwportals.conf.erb . I'll dig into it...

It seems the /w/404.php redirect only happens for /portal urls, though. Not other urls.

https://www.wikipedia.org/foo -> https://en.wikipedia.org/foo
https://www.wikipedia.org/portalx -> https://en.wikipedia.org/portalx
# Redirect to 404 Not Found, normalised by the catch all redirect to enwiki, but kept as original path, not shown as /w/404.php

https://www.wikipedia.org/portal/x -> https://en.wikipedia.org/w/404.php
# Redirect to enwiki path change to /w/404.php

However, I agree we should change all 404s to not become redirects on the www domain. It's a general issue not specific to the portal sub directory.

As I understand the situation (helped by enabling rewrite logging on deployment-prep9:

https://www.wikipedia.org/foo -> https://en.wikipedia.org/foo

The catchall rewrite is the only one triggered (RewriteRule ^(.*)$ %{ENV:RW_PROTO}://en.wikipedia.<%= @domain_suffix %>$1 [R=301,L]), the 301 redirect to en.wikipedia.org happens.
/foo does not exist on en.wikipedia.org, a 404 is triggered as it should.
So we get a 302 and then a 404, which is as expected.

https://www.wikipedia.org/portal/x -> https://en.wikipedia.org/w/404.php
The RewriteRule ^/portal/.*$ - [L] is triggered, the L flag stops further processing, we look for the file on disk and fail.
ErrorDocument 404 /w/404.php is triggered and itself triggers RewriteRule ^/(upload|wiki|stats|w)/(.*)$ %{ENV:RW_PROTO}://en.wikipedia.<%= @domain_suffix %>/$1/$2 [R=301,L] which transforms this 404 in a 301. We get the correct page, but with a 301 -> 200 instead of directly with a 404.

After some testing on deployment-prep, I believe that https://gerrit.wikimedia.org/r/#/c/339657/ is fixing the issue correctly. But I'm not entirely comfortable with Apache configuration... review welcomed.

https://www.wikipedia.org/portal/x -> https://en.wikipedia.org/w/404.php
The RewriteRule ^/portal/.*$ - [L] is triggered, the L flag stops further processing, we look for the file on disk and fail.
ErrorDocument 404 /w/404.php is triggered and itself triggers [..].

What is the purpose of that rewrite rule?

As I understand the situation (helped by enabling rewrite logging on deployment-prep9:

Is deployment-prep Apache config like Production Apache config though?

Gehel added a comment.Feb 28 2017, 8:51 AM

https://www.wikipedia.org/portal/x -> https://en.wikipedia.org/w/404.php
The RewriteRule ^/portal/.*$ - [L] is triggered, the L flag stops further processing, we look for the file on disk and fail.
ErrorDocument 404 /w/404.php is triggered and itself triggers [..].

What is the purpose of that rewrite rule?

As I understand it, it is only there to stop the processing and not trigger the following rules, in particular the catchall:
RewriteRule ^(.*)$ %{ENV:RW_PROTO}://en.wikipedia.org$1 [R=301,L]

Gehel added a comment.Feb 28 2017, 8:56 AM

As I understand the situation (helped by enabling rewrite logging on deployment-prep9:

Is deployment-prep Apache config like Production Apache config though?

The same template is used on deployment-prep and production. The differences are the domain suffix (www.wikipedia.beta.wmflabs.org vs www.wikipedia.org) and one additional rewrite rule (RewriteRule ^/portal/(.*)$ /portal-master/$1 [L]) that seem to be used as a promotion mechanism.

Gehel added a comment.EditedFeb 28 2017, 10:21 AM

The ErrorDocument directive being triggered is the one in /etc/apache2/sites-available/01-main.conf (tested by changing this directive to another file and watching the logs on deployment-prep).

As I understand it (but again, I'm limited here), the ErrorDocument isn't a redirect, but something similar to an internal rewrite. As such it isn't possible to flag it as [L] (or at least, I have not found how). The RewriteRules apply first, end up in a 404 which is handled by the ErrorDocument directive, and this then goes through another round of rewrites to find the actual error document (again, this is my limited understanding, gained by reading some docs and poking around deployment-prep logs).

I think that it make sense to keep the ErrorDocument definition in 01-main.conf, but I'll add a comment in the portal config file to make it more explicit.

Side note: I'm attempting a cleanup of some of the duplication in the erb template behind the portal configuration (https://gerrit.wikimedia.org/r/#/c/340132/), but I think that this cleanup should stay separate from this issue.

Gehel added a comment.Mar 1 2017, 11:09 AM

Deployment plan (some additional docs on wikitech):

  1. disable puppet on all mw servers (salt --batch-size=25% 'mw*' cmd.run "puppet agent --disable 'deploying apache config - T158782'")
  2. merge change
  3. enable puppet and run agent on mwdebug1001
  4. test (see below)
  5. enable and run puppet on one mw server (puppet agent -t --tags mw-apache-config)
  6. check apache config (apache2ctl configtest)
  7. test (see below)
  8. re-enable puppet on all mw servers (salt -v -t 10 -b 100 -G cluster:appserver cmd.run "puppet agent --enable")
  9. run puppet on all apps servers (salt --batch-size=25% 'mw*' cmd.run 'puppet agent -t --tags mw-apache-config')
  10. test

Testing:
on tin:

apache-fast-test ~oblivian/baseurls mwdebug1001.eqiad.wmnet
apache-fast-test ~gehel/check_urls_339657 mwdebug1001.eqiad.wmnet

before the patch the following answers are expected:

after the patch:

Mentioned in SAL (#wikimedia-operations) [2017-03-01T14:54:45Z] <gehel> starting deployment of mediawiki apache config - T158782

Change 339657 merged by Gehel:
portals: do not rewrite 404 errors

https://gerrit.wikimedia.org/r/339657

Mentioned in SAL (#wikimedia-operations) [2017-03-01T15:05:41Z] <gehel> mwdebug1001 looks good, deploying on mw1209 - T158782

Mentioned in SAL (#wikimedia-operations) [2017-03-01T15:10:39Z] <gehel> mw1209 looks good, deploying on codfw - T158782

Mentioned in SAL (#wikimedia-operations) [2017-03-01T15:18:51Z] <gehel> testing a few host on codfw looks good, deploying on eqiad - T158782

Mentioned in SAL (#wikimedia-operations) [2017-03-01T15:28:33Z] <gehel> deploying on eqiad completed - T158782

debt moved this task from In Progress to Done on the Discovery-Portal-Sprint board.Mar 3 2017, 3:50 PM

Moving this to done as it seems like everything is finished with this fix. Please comment/reopen if there is more to do.

debt closed this task as Resolved.Mar 3 2017, 4:08 PM
debt claimed this task.
debt reopened this task as Open.Mar 6 2017, 6:27 PM
debt moved this task from Done to Needs code review on the Discovery-Portal-Sprint board.
debt removed debt as the assignee of this task.

This might be causing issues - as noted recently in T153764#3076438, let's get it checked out, please.

Dzahn removed a subscriber: Dzahn.Mar 6 2017, 6:29 PM
Krinkle removed a subscriber: Krinkle.Mar 6 2017, 7:38 PM

With the changes to the server config and the full list of URLs in the urls-to-purge.txt, I think we can mark this issue as done.

debt closed this task as Resolved.Mar 15 2017, 7:33 PM