Investigate delay growth in trending service
Closed, ResolvedPublic

Description

I've put up a simple grafana dashboard for the trending service:
https://grafana-admin.wikimedia.org/dashboard/db/trending-service
and one graph there is a mystery:
https://grafana-admin.wikimedia.org/dashboard/db/trending-service?from=now-3h&to=now&panelId=3&fullscreen

This graph shows the average delay between event creation and event consumption, and the delay steadily grows over time. We need to investigate what's going on, the delay should be stable and it should in an order of seconds.

Pchelolo created this task.Dec 13 2016, 8:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 13 2016, 8:42 PM

Could edits not be replaying correctly? Pages in memory seems very low and pages in memory seems to go down despite no purges.

@Jdlrobson That's what is mysterious. When we've redeployed the service the relay seemed to work - there was a lot of trending pages right after the service restart. But it looks like that over time pages were removed from trending and new pages were not added. Something is fishy here, I've created https://gerrit.wikimedia.org/r/#/c/327045/1 to be able to enable sampled trace logging and investigate further.

Jdlrobson moved this task from Backlog to Next up on the Trending-Service board.Dec 14 2016, 12:52 AM

Change 327168 had a related patch set uploaded (by Mobrovac):
Update trending-edits to a0ccc3f

https://gerrit.wikimedia.org/r/327168

Change 327168 merged by Mobrovac:
Update trending-edits to a0ccc3f

https://gerrit.wikimedia.org/r/327168

So, I've studied this a bit, and the delay is real. It seems the trending service falls behind pretty quickly after starting up. It is not clear at this point why this is happening since the service doesn't have a problem in replaying the previous hour's events in a matter of seconds.

Change 327212 had a related patch set uploaded (by Mobrovac):
Use the message's original TS for commit calculation.

https://gerrit.wikimedia.org/r/327212

Change 327212 merged by Mobrovac:
Use the message's original TS for commit calculation

https://gerrit.wikimedia.org/r/327212

GWicke added a subscriber: GWicke.Dec 14 2016, 7:49 PM

Is anything stopping the service from consuming new edits after catching up?

That's very surprising, but I found the reason. The node-rdkafka driver has several consumption modes:

  • flowing mode when it should read messages as fast as possible
  • non-flowing mode when we instruct it to read messages one-by-one manually.

We've used the flowing mode in the service as it looked more suitable, but it seems there's some thread contention or something in the flowing mode, so the service indeed can't keep up with the rate of events. Incredible.

We have 2 options here:

  • Switch to non-flowing mode that's been successfully use by change-prop
  • Try to investigate the bug in the driver, fix it and stay with flowing mode.

The first one is easier, but the code overall would be less clean. So I propose to go with option 2. Will keep you updated

That's very surprising, but I found the reason. The node-rdkafka driver has several consumption modes:

  • flowing mode when it should read messages as fast as possible
  • non-flowing mode when we instruct it to read messages one-by-one manually.

That was my conclusion after looking at it today as well. Non-flowing mode works just fine in Change Prop, but the flowing mode's perpetual event loop seems to have a problem.

We have 2 options here:

  • Switch to non-flowing mode that's been successfully use by change-prop
  • Try to investigate the bug in the driver, fix it and stay with flowing mode.

    The first one is easier, but the code overall would be less clean. So I propose to go with option 2. Will keep you updated

I would like to see option 2 enacted in the long term too, but given the deployment freeze next week I would prefer to go with option (1) for now to minimise potential problems during the 4-week freeze period.

mobrovac triaged this task as High priority.Dec 14 2016, 10:33 PM

Change 327378 had a related patch set uploaded (by Ppchelko):
Temporarily switch to non-flowing mode.

https://gerrit.wikimedia.org/r/327378

Ok @mobrovac makes sense. I've uploaded a patch that switches it to non-flowing mode. In the meantime I will continue figuring out the bug in the driver.

Created an upstream issue: https://github.com/Blizzard/node-rdkafka/issues/75 I know a way how to fix it up upstream, but it's a bit hacky until we discuss and understand the core reason of this problem.

Change 327378 merged by Mobrovac:
Temporarily switch to non-flowing mode.

https://gerrit.wikimedia.org/r/327378

Mentioned in SAL (#wikimedia-operations) [2016-12-15T12:04:44Z] <mobrovac@tin> Starting deploy [trending-edits/deploy@200a709]: Fix for T153122

Mentioned in SAL (#wikimedia-operations) [2016-12-15T12:05:08Z] <mobrovac@tin> Finished deploy [trending-edits/deploy@200a709]: Fix for T153122 (duration: 00m 24s)

mobrovac changed the task status from Open to Stalled.
mobrovac lowered the priority of this task from High to Normal.
mobrovac edited projects, added User-mobrovac; removed Patch-For-Review.

The fix did the trick! Thank you @Pchelolo ! Let's follow up here once the upstream issue has been resolved.

I've fixed the upstream bug, lets wait for a new version to be published and update the trending service.

Change 327859 had a related patch set uploaded (by Ppchelko):
Kafka consumer: Switch back to flowing mode.

https://gerrit.wikimedia.org/r/327859

Change 327859 merged by Mobrovac:
Kafka consumer: Switch back to flowing mode.

https://gerrit.wikimedia.org/r/327859

Mentioned in SAL (#wikimedia-operations) [2017-01-18T00:53:58Z] <mobrovac@tin> Starting deploy [trending-edits/deploy@1d53b7c]: fixes for T153122 and T145571

Mentioned in SAL (#wikimedia-operations) [2017-01-18T00:59:04Z] <mobrovac@tin> Finished deploy [trending-edits/deploy@1d53b7c]: fixes for T153122 and T145571 (duration: 05m 06s)

mobrovac closed this task as Resolved.Jan 18 2017, 1:04 AM
mobrovac edited projects, added Services (done); removed Patch-For-Review, Services (doing).

Deployed, resolving.