MW Job consumers sometimes pause for several minutes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcausse
	May 27 2019, 12:36 PM

Description

I has happened twice recently and affected the cirrusSearchLinksUpdatePrioritized queue.

The impact on Cirrus updates is visible as we nearly stop pushing data to elastic from CirrusSearch:

From Kafka Consumer lag just (when we seem to resume consuming 12:20) we have enqueued 164k doc to this queue (template update?):

In this example we seem to have stopped consuming this queue for about 20 minutes (2019-05-27 from 12:00 to 12:20).

Details

	Subject	Repo	Branch	Lines +/-
	cirrus: increase alerting threshold for Cirrus update rate check.	operations/puppet	production	+4 -1

Customize query in gerrit

Related Objects

Mentioned In: T256444: several purgeds badly backlogged (> 10 days)
T240702: mediawiki.job.cirrusSearchElasticaWrite topics need more partitions!
T240328: Slow indexing of Lexemes for wbsearchentities
Mentioned Here: T210944: librdkafka 1.0.0 deprecates functions used by varnishkafka

Event Timeline

dcausse created this task.May 27 2019, 12:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 27 2019, 12:36 PM

dcausse updated the task description. (Show Details)May 27 2019, 12:36 PM

dcausse edited projects, added WMF-JobQueue; removed MediaWiki-Core-JobQueue.

dcausse moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.

dcausse updated the task description. (Show Details)May 27 2019, 12:51 PM

dcausse updated the task description. (Show Details)May 27 2019, 12:54 PM

• mobrovac edited projects, added Services (next), Platform Team Legacy (Next), Platform Engineering (Needs Cleaning - Security, stability, performance, and scalability (TEC1)); removed Services.May 27 2019, 3:47 PM

• mobrovac edited subscribers, added: • Pchelolo; removed: Aklapper.

EBernhardson subscribed.Jun 3 2019, 4:46 PM

• Pchelolo removed projects: Platform Team Legacy (Next), Services (next).Jul 17 2019, 4:56 PM

This is still causing flaps daily....moving up in priority.

• Pchelolo added a project: Platform Team Workboards (Clinic Duty Team).Jul 23 2019, 5:36 PM

I've found a more recent event of this happening.

Change-prop stopped consuming cirrusSearchLinksUpdatePrioritized at 23:32 on 07-27, then resumed back at 23:51.

Nothing interesting in Change-Prop logs during that time period.

A bit of related Kafka logs:

cat /var/log/kafka/server.log | grep '07-27 23' | grep cirrusSearchLinksUpdatePrioritized

[2019-07-27 23:10:36,091] INFO [GroupCoordinator 1001]: Preparing to rebalance group change-prop-cirrusSearchLinksUpdatePrioritized with old generation 32628 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:10:37,942] INFO [GroupCoordinator 1001]: Stabilized group change-prop-cirrusSearchLinksUpdatePrioritized generation 32629 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:10:37,946] INFO [GroupCoordinator 1001]: Assignment received from leader for group change-prop-cirrusSearchLinksUpdatePrioritized for generation 32629 (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:30:36,295] INFO [GroupCoordinator 1001]: Preparing to rebalance group change-prop-cirrusSearchLinksUpdatePrioritized with old generation 32629 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:30:38,056] INFO [GroupCoordinator 1001]: Stabilized group change-prop-cirrusSearchLinksUpdatePrioritized generation 32630 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:30:38,071] INFO [GroupCoordinator 1001]: Assignment received from leader for group change-prop-cirrusSearchLinksUpdatePrioritized for generation 32630 (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:50:36,192] INFO [GroupCoordinator 1001]: Preparing to rebalance group change-prop-cirrusSearchLinksUpdatePrioritized with old generation 32630 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:50:38,058] INFO [GroupCoordinator 1001]: Stabilized group change-prop-cirrusSearchLinksUpdatePrioritized generation 32631 (__consumer_offsets-14) (kafka.coordinator.group.GroupCoordinator)
[2019-07-27 23:50:38,065] INFO [GroupCoordinator 1001]: Assignment received from leader for group change-prop-cirrusSearchLinksUpdatePrioritized for generation 32631 (kafka.coordinator.group.GroupCoordinator)

So it seems that at 23:30 there was a rebalance, which caused change-prop to stop processing the events until the next rebalance in 23:50.

• Pchelolo moved this task from Inbox to Doing(WIP:5) on the Platform Team Workboards (Clinic Duty Team) board.Jul 29 2019, 5:30 PM

This looks a lot like a bug in either librdkafka or node-rdkafka. Searching through issues in both regarding rebalance behavior, I've found that there's a couple of fixes in newer librdkafka versions that seem very similar to what we're experiencing.

We're running on librdkafka 0.11.3, which is quite old, so I propose to try getting ourselves on a newer librdkafka version before trying to dig much deeper into it.

relevant: T210944: librdkafka 1.0.0 deprecates functions used by varnishkafka

• Pchelolo moved this task from Doing(WIP:5) to Inbox on the Platform Team Workboards (Clinic Duty Team) board.Jul 29 2019, 11:19 PM

Gehel subscribed.Aug 8 2019, 7:38 AM

Change 529368 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] cirrus: increase alerting threshold for Cirrus update rate check.

https://gerrit.wikimedia.org/r/529368

gerritbot added a project: Patch-For-Review.Aug 9 2019, 2:11 PM

Change 529368 merged by Gehel:
[operations/puppet@production] cirrus: increase alerting threshold for Cirrus update rate check.

https://gerrit.wikimedia.org/r/529368

Maintenance_bot removed a project: Patch-For-Review.Aug 9 2019, 3:10 PM

This is currently blocked internally, when @Ottomata is back we should schedule investigation into this.

EBernhardson moved this task from Waiting to Blocked/Waiting on the Discovery-Search (Current work) board.Sep 12 2019, 8:17 PM

@Ottomata @Pchelolo As we all think about goals and work for the next quarter, I'd like to advocate this one to be addressed. While everything does eventually get processed, we see pretty dramatic fluctuation in jobs processed on a daily basis due to consumers stopping for 10 or 20 minutes at a time. This is not incredibly pressing, but it seems like a big flaw that should be addressed.

dcausse mentioned this in T240328: Slow indexing of Lexemes for wbsearchentities.Dec 11 2019, 2:36 PM

• eprodromou subscribed.Dec 18 2019, 4:40 PM

WDoranWMF added a project: Platform Engineering.Jan 8 2020, 5:34 PM

Eevans added a project: Platform Team Workboards (Clinic Duty Team).Jan 10 2020, 5:38 PM

Eevans removed a project: Platform Engineering.

This has been called out multiple times in scrum-of-scrums. It'd be great to get it accepted for a near-future Clinic Duty sprint.

@eprodromou This is dependent on the new k8 version of Changeprop moving to prod. This is in process and I will try to get a new ETA but it will not be immediate. Once that happens we can debug further.

EBernhardson mentioned this in T240702: mediawiki.job.cirrusSearchElasticaWrite topics need more partitions!.Feb 20 2020, 5:13 PM

Krinkle moved this task from Untriaged to EventBus infra on the WMF-JobQueue board.Mar 6 2020, 11:08 PM

WDoranWMF moved this task from Inbox to Backlog on the Platform Team Workboards (Clinic Duty Team) board.Mar 24 2020, 3:49 PM

• Pchelolo moved this task from Backlog to Blocked Externally on the Platform Team Workboards (Clinic Duty Team) board.Mar 27 2020, 4:54 PM

Naike changed the task status from Open to Stalled.May 22 2020, 6:57 AM

So, we've finally moved changeprop to k8s and updated everything to latest versions: node, node-rdkafka, librdkafka - everything is not fresh like a garden tomato. Looking at the graphs for the last 7 day, I do not see any sudden stops in processing of any jobs - all runs smooth. I'm inclined to resolve this task as it seems like my initial suspicion that updating to later versions of the dependencies will make things more stable has confirmed.

However, 1 week is not a huge amount of time, so please feel free to reopen if the problem reoccurs.

• ema mentioned this in T256444: several purgeds badly backlogged (> 10 days).Jun 30 2020, 8:22 AM

MW Job consumers sometimes pause for several minutesClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

MW Job consumers sometimes pause for several minutes
Closed, ResolvedPublic
Actions