Page MenuHomePhabricator

enwiki Main_Page timeouts
Closed, ResolvedPublic

Description

Periodically (a few times a day), apache response on rendering the enwiki Main_Page times out. This causes pybal to fail healthchecks against that URL and attempt to depool multiple servers. pybal monitoring of the upstream nginx/varnish's also times out on the same URL at the same time, which indicates that a cache invalidation on Main_Page is happening around the same time.

Suspicion at this point is this is related to Main_Page content updates, needs further investigation...

Event Timeline

BBlack created this task.Jun 29 2015, 6:17 PM
BBlack raised the priority of this task from to Needs Triage.
BBlack updated the task description. (Show Details)
BBlack added projects: acl*sre-team, Traffic.
BBlack added subscribers: BBlack, Joe, faidon, chasemp.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJun 29 2015, 6:17 PM
BBlack triaged this task as High priority.Jun 29 2015, 6:54 PM
BBlack moved this task from Triage to In Progress on the Traffic board.
aaron added a subscriber: aaron.Jun 29 2015, 7:36 PM

T102916 is likely a side effect of excess purges.

Anybody planning to investigate this?

mark added a subscriber: mark.
@BBlack wrote:

Periodically (a few times a day), apache response on rendering the enwiki Main_Page times out.

7 weeks later: Is this still happening? Anybody planning to investigate this?

ori claimed this task.Aug 21 2015, 9:27 PM
ori added a project: Performance-Team.
ori set Security to None.
ori added a comment.Aug 21 2015, 9:30 PM

The logs corroborate bblack's suspicion:

fluorine:/a/mw-log/slow-parse.log:

2015-08-21 16:33:37 mw2077 enwiki slow-parse INFO: 5.76  Main_Page {"private":false}

lvs1003:/var/log/pybal.log:

2015-08-21 16:35:59.121311 [apaches_80] Could not depool server mw1249.eqiad.wmnet because of too many down!
greg added a subscriber: greg.Aug 21 2015, 9:41 PM

For PyBal, we switched monitoring to Special:BlankPage, so this doesn't depool servers anymore. So now it has even less visibility from ops' perspective, but it's still a user-facing issue.

@ori: Any news? Or should this have lower priority?

@ori: Any news? Or should this have lower priority?

ori added a comment.Nov 25 2015, 5:27 PM

@Aklapper, no, it's still serious. I'm just swamped; sorry.

BBlack moved this task from In Progress to Up Next on the Traffic board.Nov 30 2015, 6:03 PM
BBlack moved this task from Up Next to Triage on the Traffic board.Apr 4 2016, 12:38 PM
BBlack moved this task from Triage to Watching on the Traffic board.Oct 4 2016, 12:46 PM
Krinkle added a subscriber: Krinkle.EditedMar 1 2017, 10:22 PM
[22:18 UTC] krinkle at fluorine.eqiad.wmnet in /a/mw-log
$ grep Main_Page slow-parse.log
2017-03-01 15:04:16mw2108enwikislow-parse3.35 Main_Page
2017-03-01 17:11:59mw2151enwikislow-parse3.02 Main_Page
2017-03-01 17:23:28mw2113enwikislow-parse3.03 Main_Page
2017-03-01 17:52:04mw2242betawikiversityslow-parse4.77 Main_Page
2017-03-01 17:52:05mw2104specieswikislow-parse5.17 Main_Page
2017-03-01 17:52:08mw2199kiwikislow-parse4.35 Main_Page
2017-03-01 17:52:09mw2194rnwikislow-parse4.52 Main_Page
2017-03-01 17:52:10mw2166wawikislow-parse3.40 Editable_Main_Page
2017-03-01 17:52:11mw2227twwikislow-parse9.20 Main_Page

If this turns out to be unspecific to the Main Page and just a general issue with MediaWiki/Apache requests sometimes taking over a second due to some kind of stall somewhere in the stack, could be related to T164248: HTTP responses from app servers sometimes stall for >1s.

Krinkle closed this task as Resolved.Jul 22 2017, 2:12 AM

Closing. Contrary to my previous comment, at least from tailing Varnish for a while, and from looking at slowparse.log, we no longer have any timeouts. We do still have Main_Page parses that take between 3-10 seconds, but that's not causing a timeout.

(And this isn't an issue about complex templates causing the Parser to take more than a second to parse a page, it has gotten better with HHVM - typical outliers are now 7-10s instead of 30+s).

Random app server requests taking longer is tracked further at T164248.