Page MenuHomePhabricator

Beta cluster seems to be extremely slow for logged in user during page navigation
Open, MediumPublic

Description

Steps to reproduce:

  1. Go to this page as a logged in user: https://en.wikipedia.beta.wmflabs.org/wiki/User:RYasmeen_(WMF)/sandbox
  2. Click on the tab "View History".
  3. Check any diff.
  4. Navigate to some other page.

All of these are taking a long time to complete.

Event Timeline

Ryasmeen created this task.Nov 6 2020, 6:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 6 2020, 6:57 PM
Ryasmeen updated the task description. (Show Details)Nov 6 2020, 6:58 PM
Ryasmeen updated the task description. (Show Details)

Hrm, I can recreate this.

The MediaWIki host seems to be running just fine afaict, but it takes some time for the deployment-cache-text06 host to relay traffic over to deployment-mediawiki-07: something's wrong on that host.

It's been a while since I've looked at the cache frontends...

Now I would imagine it's *very* slow since I broke it: T267439

thcipriani added a subscriber: ema.

I think the problem is somewhere in the varnish layers; however, I'm conceding defeat there since I spent the last hour getting back to where I started.

@ema I noticed you'd done some work on the deployment-cache-text06 layers at some point -- would you have time to take a look at what might be causing routing slowness there?

Restricted Application added a project: SRE. · View Herald TranscriptNov 6 2020, 9:22 PM
jijiki triaged this task as Medium priority.Nov 10 2020, 4:08 PM
ema added a comment.Nov 23 2020, 12:44 PM

@ema I noticed you'd done some work on the deployment-cache-text06 layers at some point -- would you have time to take a look at what might be causing routing slowness there?

I did, but the issue does not seem related -- as long as I'm using the right origin server! Cache misses/pass traffic are served by deployment-mediawiki-07.deployment-prep.eqiad.wmflabs, is that right?

The server is very slow when responding to authenticated requests indeed, Portal:Current_events takes ~7 seconds:

ema@deployment-cache-text06:~$ time curl -v -H "Cookie: enwikimwuser-sessionId=REDACTED; ss0-enwikiSession=REDACTED; enwikiSession=REDACTED; enwikiss0-UserID=44787; enwikiUserID=44787; enwikiss0-UserName=Fotosintesi; enwikiUserName=Fotosintesi; centralauth_ss0-User=Fotosintesi; centralauth_User=Fotosintesi; ss0-centralauth_Session=REDACTED; centralauth_Session=REDACTED; loginnotify_prevlogins=2020-1k1gx82-fd0b8221tuvk5d7jsnirhx3cmuxy3az; enwiki-mw-tour=%7B%22version%22%3A1%2C%22tours%22%3A%7B%7D%7D" -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.beta.wmflabs.org"  deployment-mediawiki-07.deployment-prep.eqiad.wmflabs/wiki/Portal:Current_events
[...]
real    0m7.537s

I did, but the issue does not seem related -- as long as I'm using the right origin server! Cache misses/pass traffic are served by deployment-mediawiki-07.deployment-prep.eqiad.wmflabs, is that right?

Indeed!

The server is very slow when responding to authenticated requests indeed, Portal:Current_events takes ~7 seconds:

ema@deployment-cache-text06:~$ time curl -v -H "Cookie: enwikimwuser-sessionId=REDACTED; ss0-enwikiSession=REDACTED; enwikiSession=REDACTED; enwikiss0-UserID=44787; enwikiUserID=44787; enwikiss0-UserName=Fotosintesi; enwikiUserName=Fotosintesi; centralauth_ss0-User=Fotosintesi; centralauth_User=Fotosintesi; ss0-centralauth_Session=REDACTED; centralauth_Session=REDACTED; loginnotify_prevlogins=2020-1k1gx82-fd0b8221tuvk5d7jsnirhx3cmuxy3az; enwiki-mw-tour=%7B%22version%22%3A1%2C%22tours%22%3A%7B%7D%7D" -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.beta.wmflabs.org"  deployment-mediawiki-07.deployment-prep.eqiad.wmflabs/wiki/Portal:Current_events
[...]
real    0m7.537s

Ah! TIL: X-Forwarded-Proto -- I couldn't make beta respond to curl locally previously. All I could say is that deployment-mediawiki-07 was responding instantly to apache requests.

Did some further investigation. Special:BlankPage spends 4 seconds rendering on the appserver if there is a user session, but 0m0.129s rendering if logged out.

Will need more digging. Current suspicion: slowness in session store, slow connection to session store, session store overloaded.

Thank you for your insights @ema!

ema moved this task from Triage to Watching on the Traffic board.Nov 24 2020, 3:18 PM