Unexplained increase in save times, possibly associated with DC switchover
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RLazarus
	Sep 1 2020, 3:52 PM

Description

Perf's Save Timing dashboard shows a multiple-second increase in save times, starting about when we migrated active-active services to codfw-only (Aug 31 about 14:21 UTC), but not recovering fully when we moved MediaWiki to codfw today (Sep 1 about 14:04 UTC).

Nothing points to the switchover yet except for the timing; three seconds is much longer than we would expect from cross-dc latency.

Related Objects

Mentioned In: T261846: eventgate-main latencies very high since the failover to codfw
Mentioned Here: T261846: eventgate-main latencies very high since the failover to codfw

Event Timeline

RLazarus created this task.Sep 1 2020, 3:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 1 2020, 3:52 PM

LSobanski subscribed.Sep 1 2020, 4:09 PM

CDanis subscribed.Sep 1 2020, 5:20 PM

RhinosF1 subscribed.Sep 2 2020, 5:44 AM

Joe mentioned this in T261846: eventgate-main latencies very high since the failover to codfw.Sep 2 2020, 10:07 AM

It seems the actions taken to solve T261846 have solved this issue as well. Let's keep an eye on it but it seems so.

Save times keep being back to previous levels, aproximatelly. For historical purposes, was something moved back to eqiad, or did just the extra partitioning what made it get fixed (the other ticket is unclear on this)?

Everything started by this graph:

https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?viewPanel=41&orgId=1&from=1598875576991&to=1599117396721&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-main&var-site=codfw&var-ops_datasource=codfw%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=$__all

That is basically the avg latency of eventgate-main's producer to kafka-main codfw, broken down by broker. kafka2003 seemed reporting a high latency right after the switchover. Also, 2003 was the kafka topic partition leader for two high volume topics, resource_purge and resource_change (basically a partition leader gets the responsibility to be the one, out of three, getting the messages from producers).

Joe moved restbase-async temporarily to eqiad (repooling eventgate-main eqiad too) as first step, that seemed to help both eventgate and save timings latencies. Then we increased the number of partitions for resource_purge, and finally for resource_change. After the first change in partition number, Joe's change was rolled back. This helped to spread traffic volume among 3 kafka brokers instead of concentrating on one.

As far as I get this is the leading theory, not sure if there is a more precise/better one.

We have moved restbase-async to eqiad again, as the load was still too high. We might have to consider expanding the kafka clusters or adding a new one dedicated to purges and resource changes.

Looks good to me now:

Grafana: Save Timing

Backend Save Timing

Frontend Save Timing

	F32243623: Screenshot 2020-09-03 at 16.57.27.png
	Sep 3 2020, 3:58 PM

	F32243622: Screenshot 2020-09-03 at 16.57.59.png
	Sep 3 2020, 3:58 PM

Unexplained increase in save times, possibly associated with DC switchoverClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Unexplained increase in save times, possibly associated with DC switchover
Closed, ResolvedPublic
Actions