Page MenuHomePhabricator

eventgate-main latencies very high since the failover to codfw
Closed, ResolvedPublic

Description

When we transitioned all services but mediawiki to codfw, we saw the eventgate latencies spike up for MediaWiki, which was kinda-expected given we were going cross-dc:

These are the eventgate latencies seen from servers in eqiad.

Now we expected latencies to go down once we transitioned mediawiki to codfw. Instead not just it didn't go down, but we also saw a very high latency in restbase (which was transitioned together with eventgate to codfw)

This seems like a possible cause of T261763 and is a very serious performance regression.

Event Timeline

Joe created this task.Wed, Sep 2, 10:07 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Sep 2, 10:07 AM
Joe triaged this task as Unbreak Now! priority.Wed, Sep 2, 10:08 AM

Setting priority to UBN! given the seriousness of the perf regression.

Joe added a comment.Wed, Sep 2, 10:14 AM

It looks like kafka2003 is the culprit - its broker latencies are in the order of 1 seconds.

Joe added a comment.Wed, Sep 2, 10:28 AM

So this is probably due to all the purges going through the codfw kafka2003 server, and that we still haven't partitioned the purge topic.

In normal conditions, the purges are almost evenly distributed across datanceters, as restbase-async is codfw-only, and sends its purges to the local eventgate, while mediawiki is in eqiad and sends its purges to eqiad's eventgate.

So as a temporary mitigation while we find a good strategy, I propose to do the following:

  • move restbase-async to eqiad
  • repool eventgate-main in eqiad
Joe closed this task as Resolved.Wed, Sep 2, 3:36 PM
Joe claimed this task.

We added two additional partitions to resource_purge, and this seems to have solved the issue, mostly.

FYI I also increased partitions to 3 for resource_change as well.

elukey added a comment.Thu, Sep 3, 7:13 AM

Reporting the SAL entries that we mistakenly logged to another task:

Mentioned in SAL (#wikimedia-operations) [2020-09-02T14:28:58Z] <elukey> execute kafka topics --alter --topic codfw.resource-purge --partitions 3 and kafka topics --alter --topic eqiad.resource-purge --partitions 3 on kafka-main codfw

Mentioned in SAL (#wikimedia-operations) [2020-09-02T14:31:44Z] <elukey> execute kafka topics --alter --topic codfw.resource-purge --partitions 3 and kafka topics --alter --topic eqiad.resource-purge --partitions 3 on kafka-main eqiad

Mentioned in SAL (#wikimedia-operations) [2020-09-02T18:32:49Z] <ottomata> execute kafka topics --alter --topic codfw.resource-purge --partitions 3 and kafka topics --alter --topic eqiad.resource-purge --partitions 3 on kafka jumbo-eqiad (for consistency with main)

Mentioned in SAL (#wikimedia-operations) [2020-09-02T18:34:30Z] <ottomata> execute kafka topics --alter --topic codfw.resource_change --partitions 3 and kafka topics --alter --topic eqiad.resource_change --partitions 3 on kafka main-eqiad

Mentioned in SAL (#wikimedia-operations) [2020-09-02T18:37:21Z] <ottomata> execute kafka topics --alter --topic codfw.resource_change --partitions 3 and kafka topics --alter --topic eqiad.resource_change --partitions 3 on kafka main-codfw

Mentioned in SAL (#wikimedia-operations) [2020-09-02T18:38:11Z] <ottomata> execute kafka topics --alter --topic codfw.resource_change --partitions 3 and kafka topics --alter --topic eqiad.resource_change --partitions 3 on kafka jumbo-eqiad (for consistency with main)