Page MenuHomePhabricator

MW Job consumers sometimes pause for several minutes
Open, HighPublic

Description

I has happened twice recently and affected the cirrusSearchLinksUpdatePrioritized queue.

The impact on Cirrus updates is visible as we nearly stop pushing data to elastic from CirrusSearch:


From Kafka Consumer lag just (when we seem to resume consuming 12:20) we have enqueued 164k doc to this queue (template update?):

In this example we seem to have stopped consuming this queue for about 20 minutes (2019-05-27 from 12:00 to 12:20).

Details

Related Gerrit Patches:

Event Timeline

dcausse created this task.May 27 2019, 12:36 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 27 2019, 12:36 PM
dcausse updated the task description. (Show Details)May 27 2019, 12:36 PM
dcausse edited projects, added WMF-JobQueue; removed MediaWiki-JobQueue.
dcausse moved this task from In Progress to Waiting on the Discovery-Search (Current work) board.
dcausse updated the task description. (Show Details)May 27 2019, 12:51 PM
dcausse updated the task description. (Show Details)May 27 2019, 12:54 PM
debt triaged this task as High priority.Jul 23 2019, 5:32 PM
debt added a subscriber: debt.

This is still causing flaps daily....moving up in priority.

I've found a more recent event of this happening.

Change-prop stopped consuming cirrusSearchLinksUpdatePrioritized at 23:32 on 07-27, then resumed back at 23:51.

Nothing interesting in Change-Prop logs during that time period.

A bit of related Kafka logs:

cat /var/log/kafka/server.log | grep '07-27 23' | grep cirrusSearchLinksUpdatePrioritized

[2019-07-27 23:10:36,091] INFO [GroupCoordinator 1001]: Preparing to rebalance group change-prop-cirrusSearchLinksUpdatePrioritized with old generation 32628 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:10:37,942] INFO [GroupCoordinator 1001]: Stabilized group change-prop-cirrusSearchLinksUpdatePrioritized generation 32629 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:10:37,946] INFO [GroupCoordinator 1001]: Assignment received from leader for group change-prop-cirrusSearchLinksUpdatePrioritized for generation 32629 (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:30:36,295] INFO [GroupCoordinator 1001]: Preparing to rebalance group change-prop-cirrusSearchLinksUpdatePrioritized with old generation 32629 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:30:38,056] INFO [GroupCoordinator 1001]: Stabilized group change-prop-cirrusSearchLinksUpdatePrioritized generation 32630 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:30:38,071] INFO [GroupCoordinator 1001]: Assignment received from leader for group change-prop-cirrusSearchLinksUpdatePrioritized for generation 32630 (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:50:36,192] INFO [GroupCoordinator 1001]: Preparing to rebalance group change-prop-cirrusSearchLinksUpdatePrioritized with old generation 32630 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:50:38,058] INFO [GroupCoordinator 1001]: Stabilized group change-prop-cirrusSearchLinksUpdatePrioritized generation 32631 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:50:38,065] INFO [GroupCoordinator 1001]: Assignment received from leader for group change-prop-cirrusSearchLinksUpdatePrioritized for generation 32631 (kafka.coordinator.group.GroupCoordinator)

So it seems that at 23:30 there was a rebalance, which caused change-prop to stop processing the events until the next rebalance in 23:50.

This looks a lot like a bug in either librdkafka or node-rdkafka. Searching through issues in both regarding rebalance behavior, I've found that there's a couple of fixes in newer librdkafka versions that seem very similar to what we're experiencing.

We're running on librdkafka 0.11.3, which is quite old, so I propose to try getting ourselves on a newer librdkafka version before trying to dig much deeper into it.

Gehel added a subscriber: Gehel.Aug 8 2019, 7:38 AM

Change 529368 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] cirrus: increase alerting threshold for Cirrus update rate check.

https://gerrit.wikimedia.org/r/529368

Change 529368 merged by Gehel:
[operations/puppet@production] cirrus: increase alerting threshold for Cirrus update rate check.

https://gerrit.wikimedia.org/r/529368

@Ottomata @Pchelolo As we all think about goals and work for the next quarter, I'd like to advocate this one to be addressed. While everything does eventually get processed, we see pretty dramatic fluctuation in jobs processed on a daily basis due to consumers stopping for 10 or 20 minutes at a time. This is not incredibly pressing, but it seems like a big flaw that should be addressed.

This has been called out multiple times in scrum-of-scrums. It'd be great to get it accepted for a near-future Clinic Duty sprint.

@eprodromou This is dependent on the new k8 version of Changeprop moving to prod. This is in process and I will try to get a new ETA but it will not be immediate. Once that happens we can debug further.