Page MenuHomePhabricator

~100ms regression in page save timing
Closed, ResolvedPublic

Description

(original description replaced with Timo's comment T121564#1882398)

100-200ms at https://grafana-admin.wikimedia.org/dashboard/db/save-timing compared to last week. Started to diverge at 2015-12-14 16:00 UTC.

Screen Shot 2015-12-15 at 12.39.07.png (1×2 px, 344 KB)

Event Timeline

ori assigned this task to aaron.
ori raised the priority of this task from to Unbreak Now!.
ori updated the task description. (Show Details)
ori added a project: Performance-Team.
ori subscribed.

100-200ms at https://grafana-admin.wikimedia.org/dashboard/db/save-timing compared to last week. Started to diverge at 2015-12-14 16:00 UTC.

Screen Shot 2015-12-15 at 12.39.07.png (1×2 px, 344 KB)

ori set Security to None.

None of the SAL entries seem especially suspect:

16:27 thcipriani@tin Synchronized php-1.27.0-wmf.8/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 55% Part II gerrit:258985 (duration: 00m 29s)
16:14 thcipriani@tin Synchronized php-1.27.0-wmf.8/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 55% gerrit:258985 (duration: 00m 30s)
16:13 jynus: performing schema change on testwiki.page (s3)
16:05 thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: do not index User namespace on ko.wikipedia gerrit:258667 (duration: 00m 29s)
15:01 bblack: "service ganglia-monitor restart" on cp*, because most had no varnish stats flowing into ganglia

These two changes also went out around that time:

15:26: VCL: no special handling for CentralAutoLogin
https://gerrit.wikimedia.org/r/258207

15:50: cache_text/mobile: send randomized pass traffic directly to t1 backends
https://gerrit.wikimedia.org/r/258465

I am doing a rolling restart of mysql servers, that may temporarily increase the latency of 1 server out of each shard while buffers are not so warmed up. There is also some spikes in activity ongoing, both on writes and reads from jobs this week, sometimes with 3x the number of normal connections (T121623). Can you show where is the main impact on your profiling?, I asked you not long time ago and you mentioned that writing to External Storage servers was not noticeable.

Change 259518 had a related patch set uploaded (by BBlack):
Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"

https://gerrit.wikimedia.org/r/259518

Change 259518 merged by BBlack:
Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"

https://gerrit.wikimedia.org/r/259518

Looks fixes as of 15:15 today (numbers right back to normal). This aligns with the CDN change (PST):

7:48 AM <grrrit-wm> (CR) BBlack: [C: 2 V: 2] Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends" [puppet] - https://gerrit.wikimedia.org/r/259518 (https://phabricator.wikimedia.org/T121564) (owner: BBlack)

I can confirm that the issues I mentioned before are not related (commented because some IRC chat).

Change 259800 had a related patch set uploaded (by BBlack):
Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends""

https://gerrit.wikimedia.org/r/259800

Change 259800 merged by BBlack:
Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends""

https://gerrit.wikimedia.org/r/259800

So - several other changes went in to try to fix the underlying issue with pass-consistency, and they seemed successful. I've reverted the revert to see if we can get away with it now and not hurt SaveTiming. There's a caveat that there could be a lingering, decreasing number of cache entries out there which are not fixed by the latest fixes (but will slowly fix as they expire).

I didn't see any +100ms effect at the time of the merge (~21:40), but now I see such a pattern growing now, with the deviation seeming clear from ~22:40-ish onwards. It's too early to tell until we see more hours of the graph whether it will persist as a real pattern. If it obviously does and I'm not around, feel free to revert again.

Change 260048 had a related patch set uploaded (by BBlack):
varnish: appropriate t2-fe -> t1-be backend_options

https://gerrit.wikimedia.org/r/260048

Change 260048 merged by BBlack:
varnish: appropriate t2-fe -> t1-be backend_options

https://gerrit.wikimedia.org/r/260048

graph still looks like a persistent regression. Will revert the key change today and mess with this more after the holidays.

Change 260204 had a related patch set uploaded (by BBlack):
Revert "Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"""

https://gerrit.wikimedia.org/r/260204

Change 260204 merged by BBlack:
Revert "Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"""

https://gerrit.wikimedia.org/r/260204