(original description replaced with Timo's comment T121564#1882398)
100-200ms at https://grafana-admin.wikimedia.org/dashboard/db/save-timing compared to last week. Started to diverge at 2015-12-14 16:00 UTC.
ori | |
Dec 15 2015, 7:25 PM |
F3110001: Screen Shot 2015-12-16 at 11.01.18.png | |
Dec 16 2015, 7:06 PM |
F3105923: Screen Shot 2015-12-15 at 12.39.07.png | |
Dec 15 2015, 8:39 PM |
F3105740: download.png | |
Dec 15 2015, 7:25 PM |
(original description replaced with Timo's comment T121564#1882398)
100-200ms at https://grafana-admin.wikimedia.org/dashboard/db/save-timing compared to last week. Started to diverge at 2015-12-14 16:00 UTC.
100-200ms at https://grafana-admin.wikimedia.org/dashboard/db/save-timing compared to last week. Started to diverge at 2015-12-14 16:00 UTC.
None of the SAL entries seem especially suspect:
16:27 thcipriani@tin Synchronized php-1.27.0-wmf.8/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 55% Part II gerrit:258985 (duration: 00m 29s)
16:14 thcipriani@tin Synchronized php-1.27.0-wmf.8/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 55% gerrit:258985 (duration: 00m 30s)
16:13 jynus: performing schema change on testwiki.page (s3)
16:05 thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: do not index User namespace on ko.wikipedia gerrit:258667 (duration: 00m 29s)
15:01 bblack: "service ganglia-monitor restart" on cp*, because most had no varnish stats flowing into ganglia
These two changes also went out around that time:
15:26: VCL: no special handling for CentralAutoLogin
https://gerrit.wikimedia.org/r/258207
15:50: cache_text/mobile: send randomized pass traffic directly to t1 backends
https://gerrit.wikimedia.org/r/258465
I am doing a rolling restart of mysql servers, that may temporarily increase the latency of 1 server out of each shard while buffers are not so warmed up. There is also some spikes in activity ongoing, both on writes and reads from jobs this week, sometimes with 3x the number of normal connections (T121623). Can you show where is the main impact on your profiling?, I asked you not long time ago and you mentioned that writing to External Storage servers was not noticeable.
Change 259518 had a related patch set uploaded (by BBlack):
Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"
Change 259518 merged by BBlack:
Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"
Looks fixes as of 15:15 today (numbers right back to normal). This aligns with the CDN change (PST):
7:48 AM <grrrit-wm> (CR) BBlack: [C: 2 V: 2] Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends" [puppet] - https://gerrit.wikimedia.org/r/259518 (https://phabricator.wikimedia.org/T121564) (owner: BBlack)
I can confirm that the issues I mentioned before are not related (commented because some IRC chat).
Change 259800 had a related patch set uploaded (by BBlack):
Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends""
Change 259800 merged by BBlack:
Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends""
So - several other changes went in to try to fix the underlying issue with pass-consistency, and they seemed successful. I've reverted the revert to see if we can get away with it now and not hurt SaveTiming. There's a caveat that there could be a lingering, decreasing number of cache entries out there which are not fixed by the latest fixes (but will slowly fix as they expire).
I didn't see any +100ms effect at the time of the merge (~21:40), but now I see such a pattern growing now, with the deviation seeming clear from ~22:40-ish onwards. It's too early to tell until we see more hours of the graph whether it will persist as a real pattern. If it obviously does and I'm not around, feel free to revert again.
Change 260048 had a related patch set uploaded (by BBlack):
varnish: appropriate t2-fe -> t1-be backend_options
graph still looks like a persistent regression. Will revert the key change today and mess with this more after the holidays.
Change 260204 had a related patch set uploaded (by BBlack):
Revert "Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"""
Change 260204 merged by BBlack:
Revert "Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"""