Page MenuHomePhabricator

Unexplained increase in save times, possibly associated with DC switchover
Closed, ResolvedPublic

Description

Perf's Save Timing dashboard shows a multiple-second increase in save times, starting about when we migrated active-active services to codfw-only (Aug 31 about 14:21 UTC), but not recovering fully when we moved MediaWiki to codfw today (Sep 1 about 14:04 UTC).

Nothing points to the switchover yet except for the timing; three seconds is much longer than we would expect from cross-dc latency.

Event Timeline

It seems the actions taken to solve T261846 have solved this issue as well. Let's keep an eye on it but it seems so.

Save times keep being back to previous levels, aproximatelly. For historical purposes, was something moved back to eqiad, or did just the extra partitioning what made it get fixed (the other ticket is unclear on this)?

Everything started by this graph:

https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?viewPanel=41&orgId=1&from=1598875576991&to=1599117396721&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-main&var-site=codfw&var-ops_datasource=codfw%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=$__all

That is basically the avg latency of eventgate-main's producer to kafka-main codfw, broken down by broker. kafka2003 seemed reporting a high latency right after the switchover. Also, 2003 was the kafka topic partition leader for two high volume topics, resource_purge and resource_change (basically a partition leader gets the responsibility to be the one, out of three, getting the messages from producers).

Joe moved restbase-async temporarily to eqiad (repooling eventgate-main eqiad too) as first step, that seemed to help both eventgate and save timings latencies. Then we increased the number of partitions for resource_purge, and finally for resource_change. After the first change in partition number, Joe's change was rolled back. This helped to spread traffic volume among 3 kafka brokers instead of concentrating on one.

As far as I get this is the leading theory, not sure if there is a more precise/better one.

We have moved restbase-async to eqiad again, as the load was still too high. We might have to consider expanding the kafka clusters or adding a new one dedicated to purges and resource changes.

Krinkle closed this task as Resolved.EditedSep 3 2020, 3:58 PM

Looks good to me now:

Grafana: Save Timing

Backend Save Timing
Screenshot 2020-09-03 at 16.57.27.png (948×2 px, 277 KB)
Frontend Save Timing
Screenshot 2020-09-03 at 16.57.59.png (927×2 px, 239 KB)