Investigate delay growth in trending service
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Pchelolo
	Dec 13 2016, 8:42 PM

Description

I've put up a simple grafana dashboard for the trending service:
https://grafana-admin.wikimedia.org/dashboard/db/trending-service
and one graph there is a mystery:
https://grafana-admin.wikimedia.org/dashboard/db/trending-service?from=now-3h&to=now&panelId=3&fullscreen

This graph shows the average delay between event creation and event consumption, and the delay steadily grows over time. We need to investigate what's going on, the delay should be stable and it should in an order of seconds.

Details

Subject	Repo	Branch	Lines +/-
Kafka consumer: Switch back to flowing mode.	mediawiki/services/trending-edits	master	+18 -28
Temporarily switch to non-flowing mode.	mediawiki/services/trending-edits	master	+29 -26
Use the message's original TS for commit calculation	mediawiki/services/trending-edits	master	+9 -2
Update trending-edits to a0ccc3f	mediawiki/services/trending-edits/deploy	master	+7 -4

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Pchelolo	T153122 Investigate delay growth in trending service
		Declined		None	T165404 Investigation: Revive Article Blamer from Xtools?

Event Timeline

• Pchelolo created this task.Dec 13 2016, 8:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 13 2016, 8:42 PM

Could edits not be replaying correctly? Pages in memory seems very low and pages in memory seems to go down despite no purges.

@Jdlrobson That's what is mysterious. When we've redeployed the service the relay seemed to work - there was a lot of trending pages right after the service restart. But it looks like that over time pages were removed from trending and new pages were not added. Something is fishy here, I've created https://gerrit.wikimedia.org/r/#/c/327045/1 to be able to enable sampled trace logging and investigate further.

Jdlrobson moved this task from Backlog to Next up on the Trending-Service board.Dec 14 2016, 12:52 AM

Change 327168 had a related patch set uploaded (by Mobrovac):
Update trending-edits to a0ccc3f

https://gerrit.wikimedia.org/r/327168

gerritbot added a project: Patch-For-Review.Dec 14 2016, 10:52 AM

• mobrovac mentioned this in rMSTEDc3bc30aad136: Update trending-edits to a0ccc3f.Dec 14 2016, 10:52 AM

Change 327168 merged by Mobrovac:
Update trending-edits to a0ccc3f

https://gerrit.wikimedia.org/r/327168

So, I've studied this a bit, and the delay is real. It seems the trending service falls behind pretty quickly after starting up. It is not clear at this point why this is happening since the service doesn't have a problem in replaying the previous hour's events in a matter of seconds.

• mobrovac edited projects, added Services (doing); removed Patch-For-Review, Services (next).Dec 14 2016, 12:36 PM

Change 327212 had a related patch set uploaded (by Mobrovac):
Use the message's original TS for commit calculation.

https://gerrit.wikimedia.org/r/327212

gerritbot added a project: Patch-For-Review.Dec 14 2016, 3:23 PM

Change 327212 merged by Mobrovac:
Use the message's original TS for commit calculation

https://gerrit.wikimedia.org/r/327212

Is anything stopping the service from consuming new edits after catching up?

@Ottomata any ideas?

That's very surprising, but I found the reason. The node-rdkafka driver has several consumption modes:

flowing mode when it should read messages as fast as possible
non-flowing mode when we instruct it to read messages one-by-one manually.

We've used the flowing mode in the service as it looked more suitable, but it seems there's some thread contention or something in the flowing mode, so the service indeed can't keep up with the rate of events. Incredible.

We have 2 options here:

Switch to non-flowing mode that's been successfully use by change-prop
Try to investigate the bug in the driver, fix it and stay with flowing mode.

The first one is easier, but the code overall would be less clean. So I propose to go with option 2. Will keep you updated

In T153122#2874462, @Pchelolo wrote:

That's very surprising, but I found the reason. The node-rdkafka driver has several consumption modes:

flowing mode when it should read messages as fast as possible

non-flowing mode when we instruct it to read messages one-by-one manually.

That was my conclusion after looking at it today as well. Non-flowing mode works just fine in Change Prop, but the flowing mode's perpetual event loop seems to have a problem.

We have 2 options here:

Switch to non-flowing mode that's been successfully use by change-prop

Try to investigate the bug in the driver, fix it and stay with flowing mode.

The first one is easier, but the code overall would be less clean. So I propose to go with option 2. Will keep you updated

I would like to see option 2 enacted in the long term too, but given the deployment freeze next week I would prefer to go with option (1) for now to minimise potential problems during the 4-week freeze period.

• mobrovac triaged this task as High priority.Dec 14 2016, 10:33 PM

Change 327378 had a related patch set uploaded (by Ppchelko):
Temporarily switch to non-flowing mode.

https://gerrit.wikimedia.org/r/327378

Ok @mobrovac makes sense. I've uploaded a patch that switches it to non-flowing mode. In the meantime I will continue figuring out the bug in the driver.

Created an upstream issue: https://github.com/Blizzard/node-rdkafka/issues/75 I know a way how to fix it up upstream, but it's a bit hacky until we discuss and understand the core reason of this problem.

Change 327378 merged by Mobrovac:
Temporarily switch to non-flowing mode.

https://gerrit.wikimedia.org/r/327378

Mentioned in SAL (#wikimedia-operations) [2016-12-15T12:04:44Z] <mobrovac@tin> Starting deploy [trending-edits/deploy@200a709]: Fix for T153122

Mentioned in SAL (#wikimedia-operations) [2016-12-15T12:05:08Z] <mobrovac@tin> Finished deploy [trending-edits/deploy@200a709]: Fix for T153122 (duration: 00m 24s)

The fix did the trick! Thank you @Pchelolo ! Let's follow up here once the upstream issue has been resolved.

I've fixed the upstream bug, lets wait for a new version to be published and update the trending service.

Change 327859 had a related patch set uploaded (by Ppchelko):
Kafka consumer: Switch back to flowing mode.

https://gerrit.wikimedia.org/r/327859

gerritbot added a project: Patch-For-Review.Dec 16 2016, 10:44 PM

Change 327859 merged by Mobrovac:
Kafka consumer: Switch back to flowing mode.

https://gerrit.wikimedia.org/r/327859

Mentioned in SAL (#wikimedia-operations) [2017-01-18T00:53:58Z] <mobrovac@tin> Starting deploy [trending-edits/deploy@1d53b7c]: fixes for T153122 and T145571

Stashbot mentioned this in T145571: Recovery: Trending service should be able to replay last 1hr of edits.Jan 18 2017, 12:54 AM

Mentioned in SAL (#wikimedia-operations) [2017-01-18T00:59:04Z] <mobrovac@tin> Finished deploy [trending-edits/deploy@1d53b7c]: fixes for T153122 and T145571 (duration: 05m 06s)

Deployed, resolving.

• DannyH added a subtask: T165404: Investigation: Revive Article Blamer from Xtools?.May 15 2017, 9:41 PM

• DannyH closed subtask T165404: Investigation: Revive Article Blamer from Xtools? as Declined.May 17 2017, 12:31 AM

Restricted Application added a project: Product-Infrastructure-Team-Backlog-Deprecated. · View Herald TranscriptMay 17 2017, 12:31 AM

Investigate delay growth in trending serviceClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Investigate delay growth in trending service
Closed, ResolvedPublic
Actions

Related Objects
Search...