Page MenuHomePhabricator

Investigate 2016-09-08 increase in firstPaint for p95 & p99
Closed, DuplicatePublic

Description

It all started something like 2016-09-08 19:XX

p95 from 4 s to over 10 s, p99 increased from 14 s to over a minute.

Event Timeline

Peter created this task.Sep 12 2016, 6:04 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2016, 6:04 AM
Peter added a comment.Sep 12 2016, 6:24 AM

It's all in desktop. Couldn't see anything special in hit ratio for nav timing. And no special increase/decrease for how many metrics from Chrome.

On 2016-09-07 Chrome 53 was released for iOS & Android. But desktop was one week earlier: 2016-08-31. And checking per version, it all happens in 52:

Peter added a comment.Sep 12 2016, 6:39 AM

It looks like a little change for authenticated users:

And major change for anonymous:

ori added a subscriber: ori.Sep 12 2016, 6:15 PM

@Peter, the next step in isolating an issue like this is to check the Server Admin Log to see if the regression coincided with a deployment. In this case, it looks like the following entries are possible correlates:

  • 20:03 logmsgbot: demon@tin Synchronized php-1.28.0-wmf.18/extensions/Echo/includes/SeenTime.php: Trying to stop some duplicate redis fetches (duration: 00m 52s)
  • 19:57 bblack: repooling normal traffic to cache_upload in ulsfo
  • 19:02 logmsgbot: demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.18
ori added a comment.Sep 12 2016, 6:22 PM

It looks like it started at around 19:07, which lines up neatly with the group2 to wmf.18 deployment at 19:02 + 5 minutes of ResourceLoader cache time.

Mentioned in SAL [2016-09-12T18:24:51Z] <ori> Changing wikiversion for group2 wikis on mw1017 to debug regression (T145359)

Looking for JavaScript code that has changed from wmf.17 to wmf.18 and that loads for anonymous users on most or all pages, I see changes in ULS and CentralNotice. @Nikerabbit, @AndyRussG, are you aware of any changes that could have caused a regression?

@ori there was a CN update that went out in a SWAT deploy, 23:00 - 0:00 UTC. Nothing earlier that day...

Krinkle renamed this task from Investigate increase in firstPaint for p95 & p99 to Investigate 2016-09-08 increase in firstPaint for p95 & p99.Sep 12 2016, 8:57 PM
Krinkle added a subscriber: Krinkle.

That ULS update contains big jquery.uls update, but nothing in there calls my attention, as all of that code is supposed to be executed only after user interaction.

Nikerabbit moved this task from Backlog to Watching on the User-Nikerabbit board.Sep 13 2016, 6:19 AM