Page MenuHomePhabricator

MW Job consumers sometimes pause for several minutes
Closed, ResolvedPublic

Description

I has happened twice recently and affected the cirrusSearchLinksUpdatePrioritized queue.

The impact on Cirrus updates is visible as we nearly stop pushing data to elastic from CirrusSearch:

cirrus_updates.png (350×1 px, 36 KB)

From Kafka Consumer lag just (when we seem to resume consuming 12:20) we have enqueued 164k doc to this queue (template update?):
kafka_lag.png (688×1 px, 49 KB)

In this example we seem to have stopped consuming this queue for about 20 minutes (2019-05-27 from 12:00 to 12:20).

Event Timeline

debt triaged this task as High priority.Jul 23 2019, 5:32 PM
debt subscribed.

This is still causing flaps daily....moving up in priority.

I've found a more recent event of this happening.

Change-prop stopped consuming cirrusSearchLinksUpdatePrioritized at 23:32 on 07-27, then resumed back at 23:51.

Nothing interesting in Change-Prop logs during that time period.

A bit of related Kafka logs:

cat /var/log/kafka/server.log | grep '07-27 23' | grep cirrusSearchLinksUpdatePrioritized

[2019-07-27 23:10:36,091] INFO [GroupCoordinator 1001]: Preparing to rebalance group change-prop-cirrusSearchLinksUpdatePrioritized with old generation 32628 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:10:37,942] INFO [GroupCoordinator 1001]: Stabilized group change-prop-cirrusSearchLinksUpdatePrioritized generation 32629 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:10:37,946] INFO [GroupCoordinator 1001]: Assignment received from leader for group change-prop-cirrusSearchLinksUpdatePrioritized for generation 32629 (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:30:36,295] INFO [GroupCoordinator 1001]: Preparing to rebalance group change-prop-cirrusSearchLinksUpdatePrioritized with old generation 32629 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:30:38,056] INFO [GroupCoordinator 1001]: Stabilized group change-prop-cirrusSearchLinksUpdatePrioritized generation 32630 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:30:38,071] INFO [GroupCoordinator 1001]: Assignment received from leader for group change-prop-cirrusSearchLinksUpdatePrioritized for generation 32630 (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:50:36,192] INFO [GroupCoordinator 1001]: Preparing to rebalance group change-prop-cirrusSearchLinksUpdatePrioritized with old generation 32630 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:50:38,058] INFO [GroupCoordinator 1001]: Stabilized group change-prop-cirrusSearchLinksUpdatePrioritized generation 32631 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:50:38,065] INFO [GroupCoordinator 1001]: Assignment received from leader for group change-prop-cirrusSearchLinksUpdatePrioritized for generation 32631 (kafka.coordinator.group.GroupCoordinator)

So it seems that at 23:30 there was a rebalance, which caused change-prop to stop processing the events until the next rebalance in 23:50.

This looks a lot like a bug in either librdkafka or node-rdkafka. Searching through issues in both regarding rebalance behavior, I've found that there's a couple of fixes in newer librdkafka versions that seem very similar to what we're experiencing.

We're running on librdkafka 0.11.3, which is quite old, so I propose to try getting ourselves on a newer librdkafka version before trying to dig much deeper into it.

Change 529368 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] cirrus: increase alerting threshold for Cirrus update rate check.

https://gerrit.wikimedia.org/r/529368

Change 529368 merged by Gehel:
[operations/puppet@production] cirrus: increase alerting threshold for Cirrus update rate check.

https://gerrit.wikimedia.org/r/529368

@Ottomata @Pchelolo As we all think about goals and work for the next quarter, I'd like to advocate this one to be addressed. While everything does eventually get processed, we see pretty dramatic fluctuation in jobs processed on a daily basis due to consumers stopping for 10 or 20 minutes at a time. This is not incredibly pressing, but it seems like a big flaw that should be addressed.

This has been called out multiple times in scrum-of-scrums. It'd be great to get it accepted for a near-future Clinic Duty sprint.

@eprodromou This is dependent on the new k8 version of Changeprop moving to prod. This is in process and I will try to get a new ETA but it will not be immediate. Once that happens we can debug further.

Naike changed the task status from Open to Stalled.May 22 2020, 6:57 AM
Pchelolo claimed this task.

So, we've finally moved changeprop to k8s and updated everything to latest versions: node, node-rdkafka, librdkafka - everything is not fresh like a garden tomato. Looking at the graphs for the last 7 day, I do not see any sudden stops in processing of any jobs - all runs smooth. I'm inclined to resolve this task as it seems like my initial suspicion that updating to later versions of the dependencies will make things more stable has confirmed.

However, 1 week is not a huge amount of time, so please feel free to reopen if the problem reoccurs.