Page MenuHomePhabricator

Diff page consistently produces 503 on beta cluster on first visit
Closed, ResolvedPublic

Description

Visit a random page on mobile on the beta cluster (this does not impact production):
https://en.m.wikipedia.beta.wmflabs.org/wiki/Special:Random

Make an edit and save.
At the bottom of the page click the last modified bar "Last edited by" to get to the history page.
Click the first entry.
Expected: The diff will render
Actual: A 503 occurs

Upon refreshing a few times the page shows.
This seems to be easy to replicate.

Acceptance criteria

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 25 2017, 2:29 PM

Change 380520 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/MinervaNeue@master] Don't run diff test on beta cluster

https://gerrit.wikimedia.org/r/380520

Jdlrobson updated the task description. (Show Details)

Change 380520 merged by jenkins-bot:
[mediawiki/skins/MinervaNeue@master] Don't run diff test on beta cluster

https://gerrit.wikimedia.org/r/380520

greg added a subscriber: greg.

Per Dev/Maintainers adding the Contributors team as maintainers of the diff view.

phuedx added a subscriber: phuedx.Sep 26 2017, 2:00 PM

@Jdlrobson: Was/is this specific to the Beta Cluster? I can't reproduce this in production and I've had a hard time reproducing it on the BC.

Jdlrobson renamed this task from Diff page produces 503 on first visit to Diff page produces 503 on beta cluster on first visit.Sep 26 2017, 2:09 PM
Jdlrobson renamed this task from Diff page produces 503 on beta cluster on first visit to Diff page consistently produces 503 on beta cluster on first visit.
Jdlrobson updated the task description. (Show Details)
greg added a subscriber: tstarling.
17:24 <   legoktm> greg-g: https://phabricator.wikimedia.org/T176637 could potentially be an issue with
                   the new wikidiff2 version that's on beta

And thus adding @tstarling per [[mw:developers/maintainers]]

greg added a subscriber: MaxSem.Sep 26 2017, 7:45 PM

Adding @MaxSem and TCB-Team per Lego on IRC

Stacktrace:

jkroll added a subscriber: jkroll.Oct 2 2017, 3:35 PM

This seems to be related to the new wikidiff2. I wrote the moved-paragraphs patch, and I have been trying to reproduce this problem for the last couple days. It's not reproducible for me on apache/zend. I can't reproduce it with my distro-provided hhvm 3.22.0 either. Next, I tried to install mw-vagrant as per MoritzMuehlenhoff's suggestion in T176217. After working around T152801 and provisioning the box, I found that it wasn't running hhvm at all. Unsurprisingly, I couldn't reproduce the problem there, so I gave up on mw-vagrant. Instead I built HHVM-3.18.5 from source. I cannot reproduce the problem with that version either. I can display diffs just fine, including the mobile diff, and haven't seen any stack traces or suspiscious log entries.

I've exhausted all options to reproduce this I can think of. The one thing left would be getting ssh access to the beta cluster, and figuring out what is happening there.

MaxSem added a comment.Oct 2 2017, 6:01 PM

I can give you access - what's your wikitech username?

MaxSem added a comment.Oct 3 2017, 8:21 PM

Added you to the project, now you can ssh into hosts like deployment-mediawiki04.eqiad.wmflabs where the crashes happen. I also added you to project admins so that you could use sudo to connect to running processes with debugger.

jkroll added a comment.Oct 9 2017, 2:18 PM

Thanks @MaxSem. I've tried to reproduce it on deployment-mediawiki04.eqiad.wmflabs without success. According to logs, the last crash was on Sep 28.

To all: Is anybody still able to cause a crash?

jkroll added a comment.Oct 9 2017, 2:43 PM

Also, getting a core file when this happens would be helpful. The core pattern was set to /data/project/cores/ but /data doesn't seem to be mounted. I've temporarily set it to /tmp but I'm not sure if it will be overwritten by the next puppet run.

phuedx removed a subscriber: phuedx.Oct 9 2017, 3:09 PM

I have also tried reproducing this and really cant make it 503.
Do we have a link to the original page / diff that this happened on?

Thanks @MaxSem. I've tried to reproduce it on deployment-mediawiki04.eqiad.wmflabs without success. According to logs, the last crash was on Sep 28.
To all: Is anybody still able to cause a crash?

On the 28th, the update to HHVM 3.18.5 was installed via unattended-upgrades. That might have fixed it (or could be entirely unrelated/coincidental).

On the 28th, the update to HHVM 3.18.5 was installed via unattended-upgrades. That might have fixed it (or could be entirely unrelated/coincidental).

Is it possible or worth rolling back the version on 1 mediawiki host in the beta cluster to see if we can reproduce?

jkroll added a comment.EditedOct 9 2017, 3:34 PM

On the 28th, the update to HHVM 3.18.5 was installed via unattended-upgrades. That might have fixed it (or could be entirely unrelated/coincidental).

Is it likely that between rollout of the new wikidiff2 and the update, HHVM was running continuously without a restart? I.e. the new wikidiff2 binary was installed on Sep 15, but HHVM was only restarted on Sep 28.

@Addshore this doesn't appear to be happening any more, but it was 100% reproducible when it could be, so the theory about wikidiff2 (which MobileDiff uses when available is a good theory!

Jdlrobson updated the task description. (Show Details)Oct 9 2017, 4:08 PM

Is it likely that between rollout of the new wikidiff2 and the update, HHVM was running continuously without a restart? I.e. the new wikidiff2 binary was installed on Sep 15, but HHVM was only restarted on Sep 28.

Yeah, we don't have any automatic restarts for HHVM in deployment-prep, unless someone manually restarted it (unlikely), this was running between Sep 15-28.

Is it possible or worth rolling back the version on 1 mediawiki host in the beta cluster to see if we can reproduce?

Yes, we can do that tomorrow morning (Euro time).

@MoritzMuehlenhoff and I just spent some time trying to reproduce this.

We rolled hhvm back as well as the extension before stepping through the upgrade of wikidiff2 and trying to trigger 503s or things in the log all of the way, and we failed to reproduce anything.

The wikidiff2 patch introduced new parameters with default values to wikidiff2_do_diff() and wikidiff2_inline_diff(). Possible explanation for the crashes: HHVM caches data for performance reasons, including function signatures provided by extensions. Between deploying the new wikidiff2 version on the 15th and the HHVM unattended-upgrade on the 28th, HHVM would cache the old signatures in memory. The new wikidiff2 would have the new signatures though, and the PHP code is already patched to use them, causing a mismatch.

This would be consistent with the timeframe of the crashes and it would explain why we can't reproduce them now. The logs seem to show that the problems occured while trying to call wikidiff2 functions, i.e. not in native code. This is also consistent with what I say above.

Writing a program so that parts of it can be hot-swapped while it is running would take considerable effort. Unless it's documented somewhere that HHVM can do this (which I don't know - I'm not an HHVM expert), we shouldn't assume it can. It should be restarted after a native PHP extension is upgraded.

It should be restarted after a native PHP extension is upgraded.

HHVM was restarted after installing the new package (otherwise it would not load the revised modules anyway). The cache is persistent on disk, though (but we can prune it manually).

It should be restarted after a native PHP extension is upgraded.

HHVM was restarted after installing the new package (otherwise it would not load the revised modules anyway). The cache is persistent on disk, though (but we can prune it manually).

That was a misunderstanding then, I thought you said it wasn't. Disk cache is another likely explanation though. Is there documentation on where it lives and what exactly is cached?

So.

I think we might be okay to close this ticket now?

Jdforrester-WMF closed this task as Resolved.Oct 10 2017, 10:36 PM
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Provisionally marking as Resolved.

That was a misunderstanding then, I thought you said it wasn't. Disk cache is another likely explanation though. Is there documentation on where it lives and what exactly is cached?

Documentation is sparse (which applies to HHVM in many aspects), but some docs are here: http://hhvm.com/blog/4061/go-faster
That also asserts what happened here: When we moved from 3.18.2 to 3.18.5, the HHVM build ID changed and the bytecode cache was discarded

Tobi_WMDE_SW moved this task from Proposed to Done on the WMDE-QWERTY-Team board.Oct 17 2017, 1:09 PM
Tobi_WMDE_SW moved this task from Done to Demoed on the WMDE-QWERTY-Team board.Oct 17 2017, 2:17 PM