Page MenuHomePhabricator

2019-02-21 loadEventEnd overall + responseStart mobile regression
Closed, ResolvedPublic

Description

The spike that starts around 2019-02-21 03:00 is due to eqsin being taken out of rotation, with users based in Asia being directed to ulsfo instead: https://grafana.wikimedia.org/d/000000304/prometheus-varnish-dc-stats?orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cluster=cache_upload&var-layer=backend&var-layer=frontend&from=1550697339623&to=1550805339623

However, things went back to normal from a networking perspective around 2019-02-21 20:00

This networking outage seems to have masked a likely MediaWiki regression that happened during that timespan, when group2 graduated to 1.33.0-wmf.18 at 2019-02-21 18:50. loadEvendEnd overall and responseStart mobile have been at significantly higher levels since that deployment.

Event Timeline

Gilles created this task.Feb 28 2019, 11:24 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 28 2019, 11:24 AM
Gilles triaged this task as Unbreak Now! priority.Feb 28 2019, 11:25 AM
Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptFeb 28 2019, 11:25 AM
Gilles claimed this task.Feb 28 2019, 11:26 AM
Gilles closed this task as Resolved.Feb 28 2019, 12:09 PM

The explanation is that we lost most of Safari in navtiming during that deployment because of T217210: Nav timing throws exception on Safari "TypeError: entryTypes contained only unsupported types"

And since iOS devices tend to be used by people in wealthier countries, we've lost more fast traffic than slow traffic in the missing data. As seen in the report rate for the US that dropped more, relatively, than others:

Things should get back to normal when the backported bugfix goes live today with the 1.33.0-wmf.19 release.

Gilles added a comment.Mar 1 2019, 6:19 AM

Fix confirmed: