Page MenuHomePhabricator

Investigate delay growth in trending service
Closed, ResolvedPublic

Description

I've put up a simple grafana dashboard for the trending service:
https://grafana-admin.wikimedia.org/dashboard/db/trending-service
and one graph there is a mystery:
https://grafana-admin.wikimedia.org/dashboard/db/trending-service?from=now-3h&to=now&panelId=3&fullscreen

This graph shows the average delay between event creation and event consumption, and the delay steadily grows over time. We need to investigate what's going on, the delay should be stable and it should in an order of seconds.

Event Timeline

Could edits not be replaying correctly? Pages in memory seems very low and pages in memory seems to go down despite no purges.

@Jdlrobson That's what is mysterious. When we've redeployed the service the relay seemed to work - there was a lot of trending pages right after the service restart. But it looks like that over time pages were removed from trending and new pages were not added. Something is fishy here, I've created https://gerrit.wikimedia.org/r/#/c/327045/1 to be able to enable sampled trace logging and investigate further.

Change 327168 had a related patch set uploaded (by Mobrovac):
Update trending-edits to a0ccc3f

https://gerrit.wikimedia.org/r/327168

Change 327168 merged by Mobrovac:
Update trending-edits to a0ccc3f

https://gerrit.wikimedia.org/r/327168

So, I've studied this a bit, and the delay is real. It seems the trending service falls behind pretty quickly after starting up. It is not clear at this point why this is happening since the service doesn't have a problem in replaying the previous hour's events in a matter of seconds.

Change 327212 had a related patch set uploaded (by Mobrovac):
Use the message's original TS for commit calculation.

https://gerrit.wikimedia.org/r/327212

Change 327212 merged by Mobrovac:
Use the message's original TS for commit calculation

https://gerrit.wikimedia.org/r/327212

Is anything stopping the service from consuming new edits after catching up?

That's very surprising, but I found the reason. The node-rdkafka driver has several consumption modes:

  • flowing mode when it should read messages as fast as possible
  • non-flowing mode when we instruct it to read messages one-by-one manually.

We've used the flowing mode in the service as it looked more suitable, but it seems there's some thread contention or something in the flowing mode, so the service indeed can't keep up with the rate of events. Incredible.

We have 2 options here:

  • Switch to non-flowing mode that's been successfully use by change-prop
  • Try to investigate the bug in the driver, fix it and stay with flowing mode.

The first one is easier, but the code overall would be less clean. So I propose to go with option 2. Will keep you updated

That's very surprising, but I found the reason. The node-rdkafka driver has several consumption modes:

  • flowing mode when it should read messages as fast as possible
  • non-flowing mode when we instruct it to read messages one-by-one manually.

That was my conclusion after looking at it today as well. Non-flowing mode works just fine in Change Prop, but the flowing mode's perpetual event loop seems to have a problem.

We have 2 options here:

  • Switch to non-flowing mode that's been successfully use by change-prop
  • Try to investigate the bug in the driver, fix it and stay with flowing mode.

The first one is easier, but the code overall would be less clean. So I propose to go with option 2. Will keep you updated

I would like to see option 2 enacted in the long term too, but given the deployment freeze next week I would prefer to go with option (1) for now to minimise potential problems during the 4-week freeze period.

Change 327378 had a related patch set uploaded (by Ppchelko):
Temporarily switch to non-flowing mode.

https://gerrit.wikimedia.org/r/327378

Ok @mobrovac makes sense. I've uploaded a patch that switches it to non-flowing mode. In the meantime I will continue figuring out the bug in the driver.

Created an upstream issue: https://github.com/Blizzard/node-rdkafka/issues/75 I know a way how to fix it up upstream, but it's a bit hacky until we discuss and understand the core reason of this problem.

Change 327378 merged by Mobrovac:
Temporarily switch to non-flowing mode.

https://gerrit.wikimedia.org/r/327378

Mentioned in SAL (#wikimedia-operations) [2016-12-15T12:04:44Z] <mobrovac@tin> Starting deploy [trending-edits/deploy@200a709]: Fix for T153122

Mentioned in SAL (#wikimedia-operations) [2016-12-15T12:05:08Z] <mobrovac@tin> Finished deploy [trending-edits/deploy@200a709]: Fix for T153122 (duration: 00m 24s)

mobrovac changed the task status from Open to Stalled.Dec 15 2016, 12:26 PM
mobrovac assigned this task to Pchelolo.
mobrovac lowered the priority of this task from High to Medium.
mobrovac edited projects, added User-mobrovac; removed Patch-For-Review.

The fix did the trick! Thank you @Pchelolo ! Let's follow up here once the upstream issue has been resolved.

I've fixed the upstream bug, lets wait for a new version to be published and update the trending service.

Change 327859 had a related patch set uploaded (by Ppchelko):
Kafka consumer: Switch back to flowing mode.

https://gerrit.wikimedia.org/r/327859

Change 327859 merged by Mobrovac:
Kafka consumer: Switch back to flowing mode.

https://gerrit.wikimedia.org/r/327859

Mentioned in SAL (#wikimedia-operations) [2017-01-18T00:53:58Z] <mobrovac@tin> Starting deploy [trending-edits/deploy@1d53b7c]: fixes for T153122 and T145571

Mentioned in SAL (#wikimedia-operations) [2017-01-18T00:59:04Z] <mobrovac@tin> Finished deploy [trending-edits/deploy@1d53b7c]: fixes for T153122 and T145571 (duration: 05m 06s)