Page MenuHomePhabricator

~100ms regression in page save timing
Closed, ResolvedPublic

Description

(original description replaced with Timo's comment T121564#1882398)

100-200ms at https://grafana-admin.wikimedia.org/dashboard/db/save-timing compared to last week. Started to diverge at 2015-12-14 16:00 UTC.

Event Timeline

ori created this task.Dec 15 2015, 7:25 PM
ori assigned this task to aaron.
ori raised the priority of this task from to Unbreak Now!.
ori updated the task description. (Show Details)
ori added a project: Performance-Team.
ori added a subscriber: ori.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 15 2015, 7:25 PM

100-200ms at https://grafana-admin.wikimedia.org/dashboard/db/save-timing compared to last week. Started to diverge at 2015-12-14 16:00 UTC.

ori updated the task description. (Show Details)Dec 16 2015, 1:45 AM
ori set Security to None.
ori added a comment.Dec 16 2015, 2:01 AM

None of the SAL entries seem especially suspect:

16:27 thcipriani@tin Synchronized php-1.27.0-wmf.8/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 55% Part II gerrit:258985 (duration: 00m 29s)
16:14 thcipriani@tin Synchronized php-1.27.0-wmf.8/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 55% gerrit:258985 (duration: 00m 30s)
16:13 jynus: performing schema change on testwiki.page (s3)
16:05 thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: do not index User namespace on ko.wikipedia gerrit:258667 (duration: 00m 29s)
15:01 bblack: "service ganglia-monitor restart" on cp*, because most had no varnish stats flowing into ganglia

These two changes also went out around that time:

15:26: VCL: no special handling for CentralAutoLogin
https://gerrit.wikimedia.org/r/258207

15:50: cache_text/mobile: send randomized pass traffic directly to t1 backends
https://gerrit.wikimedia.org/r/258465

jcrespo added a subscriber: jcrespo.EditedDec 16 2015, 9:38 AM

I am doing a rolling restart of mysql servers, that may temporarily increase the latency of 1 server out of each shard while buffers are not so warmed up. There is also some spikes in activity ongoing, both on writes and reads from jobs this week, sometimes with 3x the number of normal connections (T121623). Can you show where is the main impact on your profiling?, I asked you not long time ago and you mentioned that writing to External Storage servers was not noticeable.

mark added a subscriber: mark.Dec 16 2015, 12:24 PM

Change 259518 had a related patch set uploaded (by BBlack):
Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"

https://gerrit.wikimedia.org/r/259518

Change 259518 merged by BBlack:
Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"

https://gerrit.wikimedia.org/r/259518

aaron added a comment.Dec 16 2015, 6:21 PM

Looks fixes as of 15:15 today (numbers right back to normal). This aligns with the CDN change (PST):

7:48 AM <grrrit-wm> (CR) BBlack: [C: 2 V: 2] Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends" [puppet] - https://gerrit.wikimedia.org/r/259518 (https://phabricator.wikimedia.org/T121564) (owner: BBlack)

I can confirm that the issues I mentioned before are not related (commented because some IRC chat).

Change 259800 had a related patch set uploaded (by BBlack):
Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends""

https://gerrit.wikimedia.org/r/259800

Change 259800 merged by BBlack:
Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends""

https://gerrit.wikimedia.org/r/259800

BBlack added a subscriber: BBlack.Dec 17 2015, 11:35 PM

So - several other changes went in to try to fix the underlying issue with pass-consistency, and they seemed successful. I've reverted the revert to see if we can get away with it now and not hurt SaveTiming. There's a caveat that there could be a lingering, decreasing number of cache entries out there which are not fixed by the latest fixes (but will slowly fix as they expire).

I didn't see any +100ms effect at the time of the merge (~21:40), but now I see such a pattern growing now, with the deviation seeming clear from ~22:40-ish onwards. It's too early to tell until we see more hours of the graph whether it will persist as a real pattern. If it obviously does and I'm not around, feel free to revert again.

Change 260048 had a related patch set uploaded (by BBlack):
varnish: appropriate t2-fe -> t1-be backend_options

https://gerrit.wikimedia.org/r/260048

Change 260048 merged by BBlack:
varnish: appropriate t2-fe -> t1-be backend_options

https://gerrit.wikimedia.org/r/260048

graph still looks like a persistent regression. Will revert the key change today and mess with this more after the holidays.

Change 260204 had a related patch set uploaded (by BBlack):
Revert "Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"""

https://gerrit.wikimedia.org/r/260204

Change 260204 merged by BBlack:
Revert "Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"""

https://gerrit.wikimedia.org/r/260204

Krinkle moved this task from Inbox to Doing on the Performance-Team board.Dec 21 2015, 7:44 PM
ori closed this task as Resolved.Dec 22 2015, 3:25 AM