Page MenuHomePhabricator

enwiki Main_Page timeouts
Closed, ResolvedPublic

Description

Periodically (a few times a day), apache response on rendering the enwiki Main_Page times out. This causes pybal to fail healthchecks against that URL and attempt to depool multiple servers. pybal monitoring of the upstream nginx/varnish's also times out on the same URL at the same time, which indicates that a cache invalidation on Main_Page is happening around the same time.

Suspicion at this point is this is related to Main_Page content updates, needs further investigation...

Event Timeline

BBlack raised the priority of this task from to Needs Triage.
BBlack updated the task description. (Show Details)
BBlack added projects: acl*sre-team, Traffic.
BBlack added subscribers: BBlack, Joe, faidon, chasemp.
BBlack triaged this task as High priority.Jun 29 2015, 6:54 PM
BBlack moved this task from Triage to In Progress on the Traffic board.

T102916 is likely a side effect of excess purges.

Anybody planning to investigate this?

@BBlack wrote:

Periodically (a few times a day), apache response on rendering the enwiki Main_Page times out.

7 weeks later: Is this still happening? Anybody planning to investigate this?

ori added a project: Performance-Team.
ori set Security to None.

The logs corroborate bblack's suspicion:

fluorine:/a/mw-log/slow-parse.log:

2015-08-21 16:33:37 mw2077 enwiki slow-parse INFO: 5.76  Main_Page {"private":false}

lvs1003:/var/log/pybal.log:

2015-08-21 16:35:59.121311 [apaches_80] Could not depool server mw1249.eqiad.wmnet because of too many down!

For PyBal, we switched monitoring to Special:BlankPage, so this doesn't depool servers anymore. So now it has even less visibility from ops' perspective, but it's still a user-facing issue.

@ori: Any news? Or should this have lower priority?

@ori: Any news? Or should this have lower priority?

@Aklapper, no, it's still serious. I'm just swamped; sorry.

[22:18 UTC] krinkle at fluorine.eqiad.wmnet in /a/mw-log
$ grep Main_Page slow-parse.log
2017-03-01 15:04:16mw2108enwikislow-parse3.35 Main_Page
2017-03-01 17:11:59mw2151enwikislow-parse3.02 Main_Page
2017-03-01 17:23:28mw2113enwikislow-parse3.03 Main_Page
2017-03-01 17:52:04mw2242betawikiversityslow-parse4.77 Main_Page
2017-03-01 17:52:05mw2104specieswikislow-parse5.17 Main_Page
2017-03-01 17:52:08mw2199kiwikislow-parse4.35 Main_Page
2017-03-01 17:52:09mw2194rnwikislow-parse4.52 Main_Page
2017-03-01 17:52:10mw2166wawikislow-parse3.40 Editable_Main_Page
2017-03-01 17:52:11mw2227twwikislow-parse9.20 Main_Page

If this turns out to be unspecific to the Main Page and just a general issue with MediaWiki/Apache requests sometimes taking over a second due to some kind of stall somewhere in the stack, could be related to T164248: HTTP responses from app servers sometimes stall for >1s.

Closing. Contrary to my previous comment, at least from tailing Varnish for a while, and from looking at slowparse.log, we no longer have any timeouts. We do still have Main_Page parses that take between 3-10 seconds, but that's not causing a timeout.

(And this isn't an issue about complex templates causing the Parser to take more than a second to parse a page, it has gotten better with HHVM - typical outliers are now 7-10s instead of 30+s).

Random app server requests taking longer is tracked further at T164248.